Application-level Keepalive is mandatory for healthy connections

sivann commented 10 years ago

Hello,

The problem:

we have now about 100 salt-minions which are installed in remote areas with 3G and satellite connections.

We loose connectivity with all of those minions in about 1-2 days after installation, with test.ping reporting "minion did not return". The state was each time that the minions saw an ESTABLISHED TCP connection, while on the salt-master there were no connection listed at all. (Yes that is correct). Tighter keepalive settings were tried with no result. (OS is linux) Each time, restarting the salt-minion fixes the problem immediately.

Obviously the connections are transparently proxied someplace, (who knows what happens with those SAT networks) so the whole tcp-keepalive mechanism of 0mq fails.

Suggestion:

Salt should handle this on the application level, so as to determine connection health and reconnect if needed by e.g. sending a dummy ping data periodically every e.g. 10 minutes and checking for valid reply. The only workarround we can see is restarting salt-minion hourly which is really ugly.

UtahDave commented 10 years ago

What version of salt are you running and on which OSes?

Can you provide the out put of salt 'minion id' test.versions_report? On May 5, 2014 6:49 AM, "Spiros Ioannou" notifications@github.com wrote:

Hello, The problem:

we have now about 100 salt-minions which are installed in remote areas with 3G and satellite connections.

We loose connectivity with all of those minions in about 1-2 days after installation, with test.ping reporting "minion did not return". The state was each time that the minions saw an ESTABLISHED TCP connection, while on the salt-master there were no connection listed at all. (Yes that is correct). Tighter keepalive settings were tried with no result. (OS is linux) Each time, restarting the salt-minion fixes the problem immediately.

Obviously the connections are transparently proxied someplace, (who knows what happens with those SAT networks) so the whole tcp-keepalive mechanism of 0mq fails.

Salt should handle this on the application level, so as to determine connection health and reconnect if needed by e.g. sending a dummy ping data sent periodically every e.g. 10 minutes and checking for reply.

— Reply to this email directly or view it on GitHubhttps://github.com/saltstack/salt/issues/12540 .

sivann commented 10 years ago

All minions are on debian squeeze, salt-minion versions: 2014.1.1+ds-1~bpo60+1 (freshly installed), with keepalive counters counting normaly (when executing ss -e or netstat -ean) Salt-master is 2014.1.3+ds-2trusty2, ubuntu 14.04

test.versions won't work since they are all unreachable, but for one I restarted manually I got this:

               Salt: 2014.1.1
             Python: 2.6.6 (r266:84292, Dec 27 2010, 00:02:40)
             Jinja2: 2.5.5
           M2Crypto: 0.20.1
     msgpack-python: 0.1.10
       msgpack-pure: Not Installed
           pycrypto: 2.1.0
             PyYAML: 3.09
              PyZMQ: 13.1.0
                ZMQ: 3.2.3

I also include a tcpdump from the minion. The master shows no connections, while the minion shows established.

tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
12:59:37.751471 IP (tos 0x0, ttl 64, id 49407, offset 0, flags [DF], proto TCP (6), length 52)
    10.11.32.161.58214 > xxxxxx.eu-west-1.compute.amazonaws.com.4505: Flags [.], cksum 0x9449 (correct), ack 1897141646, win 92, options [nop,nop,TS val 4774885 ecr 1689822158], length 0

12:59:37.755784 IP (tos 0x0, ttl 63, id 58924, offset 0, flags [DF], proto TCP (6), length 52)
    xxxxxx.eu-west-1.compute.amazonaws.com.4505 > 10.11.32.161.58214: Flags [.], cksum 0xc7a4 (correct), ack 1, win 46, options [nop,nop,TS val 1690129362 ecr 3537089], length 0

13:04:37.755293 IP (tos 0x0, ttl 64, id 49408, offset 0, flags [DF], proto TCP (6), length 52)
    10.11.32.161.58214 > xxxxxx.eu-west-1.compute.amazonaws.com.4505: Flags [.], cksum 0xbf46 (correct), ack 1, win 92, options [nop,nop,TS val 4849886 ecr 1690129362], length 0

13:04:37.762560 IP (tos 0x0, ttl 63, id 58925, offset 0, flags [DF], proto TCP (6), length 52)
    xxxxxx.eu-west-1.compute.amazonaws.com.4505 > 10.11.32.161.58214: Flags [.], cksum 0x1798 (correct), ack 1, win 46, options [nop,nop,TS val 1690436570 ecr 3537089], length 0

13:09:37.759286 IP (tos 0x0, ttl 64, id 49409, offset 0, flags [DF], proto TCP (6), length 52)
    10.11.32.161.58214 > xxxxxx.eu-west-1.compute.amazonaws.com.4505: Flags [.], cksum 0xea3f (correct), ack 1, win 92, options [nop,nop,TS val 4924887 ecr 1690436570], length 0

13:09:37.767481 IP (tos 0x0, ttl 63, id 58926, offset 0, flags [DF], proto TCP (6), length 52)
    xxxxxx.eu-west-1.compute.amazonaws.com.4505 > 10.11.32.161.58214: Flags [.], cksum 0x678f (correct), ack 1, win 46, options [nop,nop,TS val 1690743774 ecr 3537089], length 0

(xxxxamazonaws.com is the master, 10.11.32.161 is the local ip of the minion)

basepi commented 10 years ago

I think the new RAET UDP transport will be the best answer for times like these. We won't have to worry about TCP reporting whether it's alive or not, it should handle latency a lot better.

basepi commented 10 years ago

(It will also give us a lot more application-side power and introspection into what's happening, so we can solve issues like this more easily. ZMQ tends to be a black box, much harder to debug these types of problems)

sivann commented 10 years ago

I believe changing the queue mechanism to resolve this is not the best answer. UDP is not guaranteed to work either in this case. I urge you not to combine the 2 issues, since I feel that this will only delay a possible fix. I don't see why implementing an application-level keepalive cannot be performed using 0mq, this is not a 0mq bug.

basepi commented 10 years ago

Oh, I agree that UDP is not a fix-all. The advantage of this new implementation is that it brings the queuing mechanisms much closer to the application level, which will allow us to much easier build application-level keepalive (which is a given, since UDP doesn't have its own keepalive).

We would love to have ZMQ application-level keepalive, and we have by no means written it off, but it will take extensive effort and we're going to wait until RAET is out in the wild and see what the reception is. We may even end up building a TCP transport for RAET to completely replace ZMQ, at which point application-level keepalive would be a given there as well.

sivann commented 10 years ago

Do you have any time estimate for this release? Does this 0mq replacement also means that all the clients will have to be updated manually? Because that would result in a huge IT effort even for the few hundrends of salt minions we have. There have been several cases in salt development that resulted in lost minions. Having a second mechanism to access those hosts really diminishes salt's usefulness. Sorry for being a bit bitter, I really appreciate your efforts and salt itself, but I feel that breaking compatibility so often is really not the way to go.

basepi commented 10 years ago

0MQ is not being replaced. This new release will just introduce an alternate transport mechanism which must be enabled to be in use. Nothing will change unless you want it to. Even people who want to switch to RAET should be able to do so without having to manually install. All it will take is upgrading the master and minions, ensuring that the proper RAET dependencies are installed on all the systems, and then switching first the minions and then the master over to RAET In their config.

All of that said, it will be a beta product in this next release, so we won't recommend immediately switching over an entire infrastructure or anything.

To answer your original question, we are targeting 2 weeks from now for the first release candidate.

UtahDave commented 10 years ago

@sivann, have you tried setting up the Salt master to run a test.ping on all your minions on a regular basis? Maybe once an hour or once every 10 minutes? Salt has a scheduler that allows you to do that. Some people have had success doing that.

sivann commented 10 years ago

@basepi thanks for clarifying that, 2 weeks is not too long, @UtahDave We thought of that, but in some cases we loose them even in 10 minutes. Another option is to restart salt-minion by cron hourly, but this seems an overkill.

I saw in the sources that salt-minion has a scheduler but couldn't find how to re-initialize the rabbitmq connection, so as to write a simple keepalive patch (to just send a message to the server); it seems it is only initialized once in the tune_in function so it was not so simple for me to patch.

steverweber commented 10 years ago

sorry i never read the full thread. just reading through random issues.

if using the dev version of salt, perhaps something like this might help in the minion config.

master: none_mult_master_ip
ping_interval: 2
auth_timeout: 10
auth_tries: 2
auth_safemode: False
random_reauth_delay: 10

steverweber commented 10 years ago

the above will have the minions 'ping' the master every 120 seconds ... if that ping fails and 5 re-auth retries fail then the minion restarts.

sivann commented 10 years ago

Thanks @steverweber this looks promising! I will try it.

steverweber commented 10 years ago

Please let us/me know in a week from now if anything can be improved. Thanks.

steverweber commented 10 years ago

correction ping_interval is in minutes.. so every 2hours the minion pings master.

steverweber commented 10 years ago

@sivann The auto restart code was patched. https://github.com/saltstack/salt/pull/13582

How things going? Is the salt deployment more stable now?

sivann commented 10 years ago

This is not yet released as of 2014.1.7, We just installed today's devel from github and will get back with results.

sivann commented 10 years ago

It seems it is not fixed, it's actually worse. The new code causes lots of bad EST connections on the master.

minion IP: 10.11.40.161, public: 176.227.142.126 saltmaster IP: 10.0.0.212, public: 54.246.180.52

minion:
root@debian:/usr/local/bin# netstat -ean |grep 54.246.180.52
tcp        0      0 10.11.40.161:43693      54.246.180.52:4505      ESTABLISHED 0          125808  

root@debian:/usr/local/bin# ss -e|grep 54.246.180.52
ESTAB      0      0            10.11.40.161:43693        54.246.180.52:4505     timer:(keepalive,46sec,0) ino:125808 sk:f3512600

Master:
root@saltmaster:~ # netstat -ean |grep 176.227.142.126
tcp        0      0 10.0.0.212:4505         176.227.142.126:60215   ESTABLISHED 0          149365     
tcp        0      0 10.0.0.212:4506         176.227.142.126:53544   ESTABLISHED 0          156632     
tcp        0      0 10.0.0.212:4505         176.227.142.126:47687   ESTABLISHED 0          149367     
tcp        0      0 10.0.0.212:4505         176.227.142.126:40874   ESTABLISHED 0          149360     
tcp        0   1560 10.0.0.212:4505         176.227.142.126:37470   ESTABLISHED 0          149378     
tcp        0      0 10.0.0.212:4506         176.227.142.126:53513   ESTABLISHED 0          156639     
tcp        0      0 10.0.0.212:4505         176.227.142.126:54876   ESTABLISHED 0          149377     
tcp        0   1560 10.0.0.212:4505         176.227.142.126:43693   ESTABLISHED 0          150116     
tcp        0      0 10.0.0.212:4505         176.227.142.126:55295   ESTABLISHED 0          149362     
tcp        0      0 10.0.0.212:4505         176.227.142.126:39531   ESTABLISHED 0          149361     
tcp        0      0 10.0.0.212:4505         176.227.142.126:48655   ESTABLISHED 0          149363

strace on the minion at that time (logs show nothing useful):

[pid 14053] poll([{fd=10, events=POLLIN}, {fd=13, events=POLLIN}], 2, 1000) = 0 (Timeout)
[pid 14053] poll([{fd=10, events=POLLIN}], 1, 0) = 0 (Timeout)
[pid 14053] poll([{fd=13, events=POLLIN}], 1, 0) = 0 (Timeout)
[pid 14053] clock_gettime(CLOCK_MONOTONIC, {335431, 508535066}) = 0
[pid 14053] gettimeofday({1405065700, 720156}, NULL) = 0
[pid 14053] gettimeofday({1405065700, 720449}, NULL) = 0
[pid 14053] clock_gettime(CLOCK_MONOTONIC, {335431, 509479581}) = 0
[pid 14053] poll([{fd=10, events=POLLIN}, {fd=13, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 14053] poll([{fd=10, events=POLLIN}], 1, 0) = 0 (Timeout)
[pid 14053] poll([{fd=13, events=POLLIN}], 1, 0) = 0 (Timeout)
[pid 14053] clock_gettime(CLOCK_MONOTONIC, {335431, 509881187}) = 0
[pid 14053] poll([{fd=10, events=POLLIN}, {fd=13, events=POLLIN}], 2, 1000
... and so on for ever

It seems the minion shows 1 ESTABLISHED connection with the master, the master shows 9, and none actually works. What happened probably is that the minion tried to reconnect to the master, leaving all those fake ESTABLISHED connections on the master. For some reason reconnection was unsuccessful with the master since the minion does not respond to the master to salt commands. Perhaps the master does not know which connection is the right one?.

A suggestion is for the master to also ping the minions on established connections, and to close fake EST connections.

steverweber commented 10 years ago

ya something was mucked up with that commit. I created a new fix that seems much more stable... https://github.com/saltstack/salt/pull/14064

I'll likely make only a small change to that before I give the go-ahead to merge. Testing is most welcome!

steverweber commented 10 years ago

to test this patch you can do.

curl -o install_salt.sh.sh -L https://bootstrap.saltstack.com
sudo sh install_salt.sh.sh -g https://github.com/steverweber/salt.git git fix_restarts

sivann commented 10 years ago

I installed the version above, this version does not even connect to the master once, something's wrong.

Firstly it starts 2 salt-minion processes. And test.ping never works. I include the debug logfile:

2014-07-14 10:07:57,527 [salt             ][INFO    ] Setting up the Salt Minion "battens-c1.insolar-plants.net"
2014-07-14 10:07:57,534 [salt.utils.process][DEBUG   ] Created pidfile: /var/run/salt-minion.pid
2014-07-14 10:07:57,537 [salt.config      ][DEBUG   ] Reading configuration from /etc/salt/minion
2014-07-14 10:07:57,799 [salt.config      ][DEBUG   ] Including configuration from '/etc/salt/minion.d/auth_timeout.conf'
2014-07-14 10:07:57,800 [salt.config      ][DEBUG   ] Reading configuration from /etc/salt/minion.d/auth_timeout.conf
2014-07-14 10:07:57,803 [salt.config      ][DEBUG   ] Including configuration from '/etc/salt/minion.d/id.conf'
2014-07-14 10:07:57,803 [salt.config      ][DEBUG   ] Reading configuration from /etc/salt/minion.d/id.conf
2014-07-14 10:07:57,806 [salt.config      ][DEBUG   ] Including configuration from '/etc/salt/minion.d/master.conf'
2014-07-14 10:07:57,807 [salt.config      ][DEBUG   ] Reading configuration from /etc/salt/minion.d/master.conf
2014-07-14 10:07:58,302 [salt.minion                              ][DEBUG   ] Attempting to authenticate with the Salt Master at 54.246.180.52
2014-07-14 10:07:58,305 [salt.crypt                               ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-14 10:08:02,290 [salt.crypt                               ][DEBUG   ] Decrypting the current master AES key
2014-07-14 10:08:02,291 [salt.crypt                               ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-14 10:08:04,158 [salt.minion                              ][INFO    ] Authentication with master successful!
2014-07-14 10:08:06,970 [salt.crypt                               ][DEBUG   ] Decrypting the current master AES key
2014-07-14 10:08:06,972 [salt.crypt                               ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-14 10:08:11,594 [salt.crypt                               ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-14 10:08:12,552 [salt.config                              ][DEBUG   ] Reading configuration from /etc/salt/minion
2014-07-14 10:08:12,813 [salt.config                              ][DEBUG   ] Including configuration from '/etc/salt/minion.d/auth_timeout.conf'
2014-07-14 10:08:12,814 [salt.config                              ][DEBUG   ] Reading configuration from /etc/salt/minion.d/auth_timeout.conf
2014-07-14 10:08:12,817 [salt.config                              ][DEBUG   ] Including configuration from '/etc/salt/minion.d/id.conf'
2014-07-14 10:08:12,818 [salt.config                              ][DEBUG   ] Reading configuration from /etc/salt/minion.d/id.conf
2014-07-14 10:08:12,821 [salt.config                              ][DEBUG   ] Including configuration from '/etc/salt/minion.d/master.conf'
2014-07-14 10:08:12,821 [salt.config                              ][DEBUG   ] Reading configuration from /etc/salt/minion.d/master.conf
2014-07-14 10:08:13,990 [salt.utils.schedule                         ][INFO    ] Added new job __mine_interval to scheduler
2014-07-14 10:08:13,991 [salt.minion                                 ][DEBUG   ] I am battens-c1.insolar-plants.net and I am not supposed to start any proxies. (Likely not a problem)
2014-07-14 10:08:13,991 [salt.minion                                 ][INFO    ] Minion is starting as user 'root'
2014-07-14 10:08:13,992 [salt.minion                                 ][DEBUG   ] Minion 'battens-c1.insolar-plants.net' trying to tune in
2014-07-14 10:08:13,994 [salt.minion                                 ][DEBUG   ] Minion PUB socket URI: ipc:///var/run/salt/minion/minion_event_ce16006525_pub.ipc
2014-07-14 10:08:13,995 [salt.minion                                 ][DEBUG   ] Minion PULL socket URI: ipc:///var/run/salt/minion/minion_event_ce16006525_pull.ipc
2014-07-14 10:08:13,995 [salt.minion                                 ][INFO    ] Starting pub socket on ipc:///var/run/salt/minion/minion_event_ce16006525_pub.ipc
2014-07-14 10:08:13,996 [salt.minion                                 ][INFO    ] Starting pull socket on ipc:///var/run/salt/minion/minion_event_ce16006525_pull.ipc
2014-07-14 10:08:13,997 [salt.minion                                 ][DEBUG   ] Generated random reconnect delay between '1000ms' and '11000ms' (2767)
2014-07-14 10:08:13,998 [salt.minion                                 ][DEBUG   ] Setting zmq_reconnect_ivl to '2767ms'
2014-07-14 10:08:13,999 [salt.minion                                 ][DEBUG   ] Setting zmq_reconnect_ivl_max to '11000ms'

sivann commented 10 years ago

@steverweber does the master expect pings from the minions? If not it would not forget the stale connections. I think the master must be aware of this "Ping". If the master does not receive pings from the minions it could close their connections.

steverweber commented 10 years ago

The above log looks like the minion connects to the master at 54.246.180.52. Authentication with master successful!

The master does not "expect" pings but rather accepts them. The minion sends pings at the ping_interval: in minutes.

here is an aggressive configuration i use on my minions for testing.

master: ddns.name.com
ping_interval: 2
auth_timeout: 10
auth_tries: 2
auth_safemode: False
random_reauth_delay: 10

sivann commented 10 years ago

Yes, I know the minion thought it was successfully connected, but it wasn't. It seems just restarting the minion the way it is done now confuses the master somehow. I could not issue a single successful command to the minion with the above version, even after multiple restarts. Reinstalling the "stock" minion versions fixed this behaviour. Please tel me how to help debug further.

steverweber commented 10 years ago

I found an issue when in daemon mode when running under a thread... however I don't think this would cause the minion to not respond to the master.

I pushed a new fix to fix_restarts it cleans up some issues.

sivann commented 10 years ago

It is strange though that there were multiple salt-minion processes running from the beggining.. Even after killing them and restarting there were again multiple salt-minion processes. Perhaps this is manifested if the network between minion and master is slow, as in our case.

steverweber commented 10 years ago

You should see two process.

Once some tricky issues are solved then this solution can be one process.

https://github.com/saltstack/salt/pull/14236

steverweber commented 10 years ago

The current version is holding up well on my systems. However I'm holding off my pull request until this solution works on your environment. Are you testing the latest version https://github.com/steverweber/salt/tree/fix_restarts that was pushed 2 days ago? Is it working out?

sivann commented 10 years ago

I will test tommorrow.

steverweber commented 10 years ago

when testing this patch please disable your custom tcp_keepalive_* settings and reboot the system.

Thanks.

steverweber commented 10 years ago

The keepalive patch has been merged to the develop branch.

sivann commented 10 years ago

Any

sivann commented 10 years ago

@steverweber sorry for the long delay, I'm ready to test again. Where could I find your latest code to test?

sivann commented 10 years ago

Ignore last comment, I'm testing latest dev branch.

steverweber commented 10 years ago

hows the little minions behaving?

sivann commented 10 years ago

I run it in a minion that normaly gets lost in a few hours, and now with the dev version it's always responding on ocassional pings for the past 8 days. Sometimes on 2nd ping. I would say it's very good news. I'll install it on 2-3 more minions soon. Thanks.

steverweber commented 10 years ago

@sivann can this issue be closed ?

sivann commented 10 years ago

Will your patch get released ? If yes then yes I consider it fixed. My minion still responds :-)

basepi commented 9 years ago

I think it's currently only in the develop branch, so that would make it set for the feature release after 2014.7.

afletch commented 9 years ago

Seeing similar issues to this. Is there a plan for this to make it into a release?

steverweber commented 9 years ago

the patch /works/ but it's not elegant. personaly i would rather see the minion die hard and have the service manager/ systemd, upstart, whatever you have restart the minion.https://github.com/saltstack/salt/pull/22313

afletch commented 9 years ago

Does the keepalive patch (https://github.com/saltstack/salt/issues/12540#issuecomment-50223513) simply restart the minion? This was the patch I referred to.

steverweber commented 9 years ago

it restarts the minion... but it's a minion restarting ones self. You will see two minion proccesses running ps one that keeps the other one running. It was done thisway because salt code was not really built for rebuilding the minion object in the same proccess. /arg parsing and global objects are tricky/.

Looking back at the code It would be more simple to update all the different service launchers such as /systemd, upstart, init.d, launchd... to auto restart if the minion dies. https://github.com/saltstack/salt/pull/22313

afletch commented 9 years ago

Surely the better approach would be to resolve the reason for a restart being needed (minion stops communicating to the master). init.d, for example, has no auto restart ability and would need something like supervisord.

sivann commented 9 years ago

I agree, exiting the minion and relying on systemd/init/monit is also another source of technical issues: systemd timeouts, init muting the service, etc. Salt minion should be robust enough to cope with a simple network reconnection.

sivann commented 9 years ago

Well this is not completely fixed, although the ping does seem to work. I have: ping_interval: 90 auth_tries: 20 rejected_retry: True auth_safemode: False restart_on_error: True

All commands always fail at first and some in several subsequent tries. Not very reliable if you have thousands of minions. Looking forward to RAET in order to be able to actually benefit from saltstack, because in its current state we can only use it for 1-st time configurations/installations.

basepi commented 9 years ago

You might also try ping_on_rotate: True in your master config so that it will automatically send a test.ping job after AES key rotate. That solves some of the "slow to respond" issues for some users.

vielmetti commented 9 years ago

I'm in a situation where VPN connections come and go, sometimes changing the IP address of the endpoint.

I think I can cope with this with the changes to the OpenVPN config that would do a minion restart on the VPN coming up, but of course I'm similarly interested in this open issue.

steverweber commented 9 years ago

Also looking forward to RAET. Because loosing connections to minions is a painful experience.

sivann commented 9 years ago

@basepi thanks, I will try that.

saltstack / salt

Application-level Keepalive is mandatory for healthy connections #12540

The problem:

Suggestion: