str2str doesn't always reconnect to the caster when its ip address changes

Stefal commented 8 months ago

Hi !

Last night, the centipede caster which received the signal from about 500 base stations was disconnected (security problem in the datacenter). A backup server was started, with another caster instance and the dns entry was updated.

Most base stations run RTKBase, which use str2str . (Big! big! thank you to @tomojitakasu @rtklibexplorer and al for this tool)

Only half of the bases stations have reconnected to the new caster on their own, the others, including mine, are still trying to send the rtcm stream to the ip address that isn't working. str2str outputs messages like these:

-398818572 B   41263 bps (0) localhost (1) recv error (114)
-398522328 B   41003 bps (0) localhost (1) caster.centipede.fr
-398223208 B   41477 bps (0) localhost (1) recv error (114)
-397923884 B   42016 bps (0) localhost (1) recv error (114)
-397627172 B   41906 bps (0) localhost (1) connecting...
-397329568 B   40171 bps (0) localhost (1) recv error (114)
-397034524 B   38916 bps (0) localhost (1) recv error (114)
-396737696 B   39736 bps (0) localhost (1) recv error (114)

If I ping caster.centipede.fr from the base station, the correct ip address is returned. So the dns propagation is ok.

Even more strange : There is a base station with 2 gnss receivers (F9P + Mosaic X5), and 1 str2str instances for each receiver which send the rtcm streams on 2 mount points on the same caster. One str2str instance has switched to the new caster, but not the other one.

It's not the first time I noticed this problem with str2str. As a workaround, I could write a tool to parse the str2str output and restart it in case of too many 'recv error' messages, but I think it would be better to update str2str to better manage this problem.

rtklibexplorer commented 8 months ago

I should get a chance to look into this sometime in the next week or two but I don't have much expertise in IP network communications so if anyone else has a chance to dig into this and either provide more information or even better, a pull request for a code fix, that would be very helpful.

Stefal commented 8 months ago

Just an idea: I could be wrong, but I have the feeling that the problem is more "global" than a specific network problem. It could be related to how str2str behave when there is a stream malfunction.

I've never created an issue about this because the problem could be outside of RTKLib, but from time to time, a str2str instance with serial input and local tcp output doesn't work correctly. In these cases, there are no str2str output on the terminal nor in the log file created with the options -t 2 -fl log. (see https://github.com/Stefal/rtkbase/issues/333#issuecomment-1814515402)

and I see other issues related to input/output https://github.com/rtklibexplorer/RTKLIB/issues/145 https://github.com/rtklibexplorer/RTKLIB/issues/152

alexmodesto73 commented 8 months ago

Hello,

By doing some tests, I realized that DNS calls are only made on two occasions:

Service restart
Failure of the flow to the caster.

If the TTL is 60 seconds, it would require a DNS query every 60 seconds to the DNS system of the host thing.

Do you use system functions to make a DNS call every time you call a DNS domain name?

To do tests I deliberately break access to the caster in this way: route add -host 82.64.252.223 gw 127.0.0.1

When I break access after about 20 seconds I see a DNS query (see screenshot)

when I give access to the caster's IP address

route del -host 82.64.252.223 gw 127.0.0.1

The flow to the caster starts again, but I don't have a new DNS request.

Knowing that since the TTL is 60 seconds, each time it expires in the system, we should have a new DNS query in the system.

GregoireW commented 8 months ago

Hello,

By doing some tests, I realized that DNS calls are only made on two occasions:
* Service restart

* Failure of the flow to the caster.
If the TTL is 60 seconds, it would require a DNS query every 60 seconds to the DNS system of the host thing.

As long as the socket is open you do not need to reopen a new one hence you do not need to resolve the name.

alexmodesto73 commented 8 months ago

This is indeed what I observe as behavior, but why is it annoying to redo a DNS request each time the TTL expires?

The flow will not be cut since in 99.9999% of cases the IP will be similar.

However, it is the breakdown, the special case, the loss of remote administration which can force us to abandon the "MASTER" caster

To do this, you must be able to "notify" all clients of the change of IP address on the DNS, without having to cut the flow. Knowing that I make it clear that we may have a scenario where it is impossible to cut the flow (Loss of access to server administration).

This can create a "split brain" (term used in DRBD) and we end up with clients who couldn't see the new IP address and are on the old MASTER.

Is that clearer ?

GregoireW commented 8 months ago

A good rule of thumb is to never do whatever is not needed. it will cause problem one way or another.

Don't forget that if you think of a scenario where you have lost your server AND the network layer (FW, LB, WAF if any, ...) you should consider you also have lost your DNS.

rtklibexplorer commented 8 months ago

I am not very familiar with this part of the RTKLIB code (I've focused more on the GNSS algorithm side of things), but as far as I can tell, RTKLIB is not explicitly trying to resolve the DNS address itself and is relying on the calls to the operating system to do this. This might explain why some systems are able to resolve the change and others are not. I am open to implementing a solution if anyone has a specific suggestion but otherwise I don't believe I can resolve this on my own.

simeononsecurity commented 6 months ago

I think the easiest way to handle this is to on connection drop, try to resolve via dns again. Notate the available ip's, and compare it to the previously connected ip. If there is one available that isn't the one you're on, then fail over to that new ip.

I think it would also be smart on initial connection to add some sort of latency checker for all the ips that a dns name resolves. And choose the one with lowest latency by default. Otherwise, you've basically only got random and round-robin as alternatives for selecting ip's. The latency check would also be good for determining if a server is down or not. But not all host support this, so it would need an option to be disabled as well.

GregoireW commented 6 months ago

I track down a little bit more the issue. When I add latency/packet error between str2str and the caster, then it start to show something similar to what we had in december.

I think the main cause is here: https://github.com/rtklibexplorer/RTKLIB/blob/d0b599365ece2af8a22c10137beb2c8bed99b8ea/src/stream.c#L1088-L1089 the select may always return the socket is busy (0) and so it will never try to send data, in the same time, the return 0 will not trigger a failure and the socket will never be terminated.

now I do not know exactly what happens in december, so is some select were ok ? I do not know. Is on a common day some select fail ? I do not know. So what would be a trigger to decide to kill the connection there? X failures one after the other ? a failure rate on a short duration ?

rtklibexplorer / RTKLIB

str2str doesn't always reconnect to the caster when its ip address changes #166