saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Get access to the Salt software package repository here:
https://repo.saltproject.io/
Apache License 2.0
14.15k stars 5.47k forks source link

Application-level Keepalive is mandatory for healthy connections #12540

Closed sivann closed 5 years ago

sivann commented 10 years ago

Hello,

The problem:

we have now about 100 salt-minions which are installed in remote areas with 3G and satellite connections.

We loose connectivity with all of those minions in about 1-2 days after installation, with test.ping reporting "minion did not return". The state was each time that the minions saw an ESTABLISHED TCP connection, while on the salt-master there were no connection listed at all. (Yes that is correct). Tighter keepalive settings were tried with no result. (OS is linux) Each time, restarting the salt-minion fixes the problem immediately.

Obviously the connections are transparently proxied someplace, (who knows what happens with those SAT networks) so the whole tcp-keepalive mechanism of 0mq fails.

Suggestion:

Salt should handle this on the application level, so as to determine connection health and reconnect if needed by e.g. sending a dummy ping data periodically every e.g. 10 minutes and checking for valid reply. The only workarround we can see is restarting salt-minion hourly which is really ugly.

steverweber commented 9 years ago

@sivann kernel option net.ipv4.tcp_mtu_probing might be helpful on some minions.

bymodude commented 9 years ago

having exactly same issues as @sivann

I tried all the workaround described here and in various other issues. I only have ~40 Minions, all in different Clouds, Azure, AWS etc. The master is in Azure Cloud, they have a firewall in place and dont allow any ICMP (shouldn't matter but just mentioning). The minions loose connection frequently and salt is absolutely unusable because every time I want to do something I have to go and restart salt-minion on all minions that have lost connection, after few days maybe 25% of my minions are still connected.

I tried to use RAET over the last months few times, every time the codebase had serious issues with the RAET support that broke it completely (see the various open RAET support issues).

In the end of the day this simple issue makes Salt completely unusable for any "remote configuration management scenario". I am seriously considering switching to Ansible even though I would have to rewrite a huge amount of states into playbooks. I love Salt when it works, but this simple issue of minions loosing connection is so frustrating and so unacceptable that it all does not matter.

Sorry for this rant, but there are many issues about minions loosing connection, I don't understand how anyone is actually using Salt over WAN successfully, I have tried so many different servers and clouds and configuration, it happens every time, the thing is that a TCP connection is really not guaranteed to stay up forever. Why is this not being handled on the Application Level as suggested in this issue?

Seeing as this issue has been open for over 1 year, is there any intention from the Salt Team at all to fix this (or to provide working RAET implementation?) or should I cut my losses and move to another solution?

basepi commented 9 years ago

@bymodude Sorry you're having so many problems!

It's interesting because for every user who has a ton of problems with salt connections, we have hundreds of other users who are using it successfully. Not sure what's going on in your case! Have you tried hosting your master on another cloud to see if maybe there is something going on specifically with Azure?

Sorry for the RAET issues you've had. We've been focusing our efforts on a replacement TCP transport for ZMQ which should give us more visibility into what's going on when things fail, which will in turn allow us to much more easily implement application-level keepalive. It also will allow us to continue the paradigm of only having to open ports on the master, instead of all the minions, which is a pretty big "gotcha" when it comes to RAET (or UDP in general).

You might also consider checking out salt-ssh. Although the use of ssh as the transport slows salt down considerably, most state runs will work out of the box with salt-ssh.

bymodude commented 9 years ago

@basepi thanks for your prompt response

We have tried hosting the master elsewhere, which reduces the frequency of disconnected minions, but it still happens, also playing with TCP Keepalive kernel parameters influences this, e.g. keepalive of 1200 seconds will lead to lots of down minions after about 12 hours, keepalive of 75 seconds will have most minions still up after few days, but some start dropping eventually after days/weeks.

These disconnects always happened since we started using salt (we started around 2014.7 release), however just having tested 2015.5.3 and specifically trying to reproduce this issue with the disconnecting minions (by upping our tcp keepalive to 1200 which seems to trigger the issue within ~ 12 hours, and also by having one minion<->master connection which is experiencing packet loss which triggers the disconnect even more easily) we are now actually seeing some errors on the minion side. I have opened #26888 and also a related comment in #25288 (we were trying to use a manual workaround, as suggested here http://comments.gmane.org/gmane.comp.sysutils.salt.user/22264 )

From my understanding the salt-minion actually has the mechanisms to detect these lost connections now, at least we are seeing SaltReqTimeoutErrors on minion side, just the minions does not seem to do the restart to reconnect.

We are using salt-ssh to bootstrap our minions, however using it for all states is not an option due to "most state runs will work out of the box", at least in 2014.7 we found this to be not the case. Since we do have a roster containing all minions from the salt-ssh bootstrapping, a feasible workaround for us may be to setup a cronjob that does "salt '*' test.ping" every few minutes and then goes and does "salt-ssh dead_minions cmd.run 'service salt-minion restart'". It just feels clunky.

bymodude commented 9 years ago

just wanted to add regarding Azure for the benefit of others that use Salt with Minions or Master in Azure Cloud, the default tcp timeout is 4 minutes on Azure, so setting tcp keepalive below that is essential, or increasing the timeout on Azure side as documented here: http://azure.microsoft.com/en-us/blog/new-configurable-idle-timeout-for-azure-load-balancer/

basepi commented 9 years ago

I will note that salt-ssh had a lot of improvements in 2015.5, you might consider testing there. I'd be interested to hear about any disconnects in feature compatibility -- we want to give salt-ssh feature parity with normal salt, if not the same performance. (Mine and publish calls are particularly bad in the performance area right now)

steverweber commented 9 years ago

@bymodude using something like this on the minions should cause the minion to restart if it hits 2 SaltReqTimeoutErrors in a row.

ping_interval: 2
auth_timeout: 10
auth_tries: 2
auth_safemode: False
random_reauth_delay: 10

I haven't tested it in some time now because I moved my master to a stable server with a static IP... I was using a laptop with a dynamic ip address at one time and this worked moving between home and work : - )

bymodude commented 9 years ago

@steverweber thanks for the suggestion, but I am already using that config and the minions still do not restart even though SaltReqTimeoutErrors are shown in log, as documented in #26888

sivann commented 8 years ago

This issue is still open. The ping patch improves things a lot, but in the long run we lose most minions. Of our now 830 minions, about 80% do not reply to test.ping, all upgraded at least to version 2015.8.8 . Nothing ever shows on the logs.

It's not apparent if it's only networking or other bugs (auth issues?) manifesting quietly, but at its current state we cannot unfortunately consider salt for configuration management or orchestration, but we only use it for initial installation. We have created more than 100 salt states and we really would like to see salt working for us.

I would be happy to provide more information but I'm not sure how I could help debug.

The only suggestion I can offer is to drop both your custom low-level transport implementations, including RAET, and use a well proven mechanism like MQTT, using an MQTT broker on the saltmaster. It will cleanup salt's codebase, make protocol debugging much easier, and solve those issues. This will also open the door for thinner minion implementations e.g. for embedded devices.

sivann commented 8 years ago

Example logs for a minion that does no longer respond.

There is a _schedule.conf, with:

schedule:
  __mine_interval: {function: mine.update, jid_include: true, maxrunning: 2, minutes: 60}
2016-05-16 09:34:03,620 [salt.utils.event                                     ][DEBUG   ][7013] MinionEvent PUB socket URI: ipc:///var/run/salt/minion/minion_event_73b7be1ff6_pub.ipc
2016-05-16 09:34:03,621 [salt.utils.event                                     ][DEBUG   ][7013] MinionEvent PULL socket URI: ipc:///var/run/salt/minion/minion_event_73b7be1ff6_pull.ipc
2016-05-16 09:34:03,624 [salt.utils.event                                     ][DEBUG   ][7013] Sending event - data = {'clear': False, 'cmd': '_mine', 'data': {}, 'id': 'ermine-c0.insolar-plants.net', '_stamp': '2016-05-16T08:34:03.623764'}
2016-05-16 09:34:03,626 [salt.minion                                          ][DEBUG   ][1995] Handling event '_minion_mine\n\n\x85\xa5clear\xc2\xa3cmd\xa5_mine\xa4data\x80\xa2id\xbcermine-c0.insolar-plants.net\xa6_stamp\xba2016-05-16T08:34:03.623764'
2016-05-16 09:34:03,628 [salt.transport.zeromq                                ][DEBUG   ][1995] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506', 'aes')
2016-05-16 09:34:03,629 [salt.crypt                                           ][DEBUG   ][1995] Initializing new SAuth for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506')
2016-05-16 09:34:04,127 [salt.transport.zeromq                                ][DEBUG   ][7013] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506', 'aes')
2016-05-16 09:34:04,128 [salt.crypt                                           ][DEBUG   ][7013] Initializing new SAuth for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506')
2016-05-16 09:34:06,338 [salt.utils.schedule                                  ][DEBUG   ][7013] schedule.handle_func: Removing /var/cache/salt/minion/proc/20160516093403600494
2016-05-16 10:04:00,595 [salt.transport.zeromq                                ][DEBUG   ][1995] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506', 'aes')
2016-05-16 10:04:00,596 [salt.crypt                                           ][DEBUG   ][1995] Initializing new SAuth for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506')
2016-05-16 10:34:03,594 [salt.utils.schedule                                  ][INFO    ][1995] Running scheduled job: __mine_interval
2016-05-16 10:34:03,595 [salt.utils.schedule                                  ][DEBUG   ][1995] schedule: This job was scheduled with jid_include, adding to cache (jid_include defaults to True)
2016-05-16 10:34:03,596 [salt.utils.schedule                                  ][DEBUG   ][1995] schedule: This job was scheduled with a max number of 2
2016-05-16 10:34:03,614 [salt.utils.schedule                                  ][DEBUG   ][7061] schedule.handle_func: adding this job to the jobcache with data {'fun': 'mine.update', 'jid': '20160516103403600217', 'pid': 7061, 'id': 'ermine-c0.insolar-plants.net', 'schedule': '__mine_interval'}
2016-05-16 10:34:03,620 [salt.utils.event                                     ][DEBUG   ][7061] MinionEvent PUB socket URI: ipc:///var/run/salt/minion/minion_event_73b7be1ff6_pub.ipc
2016-05-16 10:34:03,621 [salt.utils.event                                     ][DEBUG   ][7061] MinionEvent PULL socket URI: ipc:///var/run/salt/minion/minion_event_73b7be1ff6_pull.ipc
2016-05-16 10:34:03,624 [salt.utils.event                                     ][DEBUG   ][7061] Sending event - data = {'clear': False, 'cmd': '_mine', 'data': {}, 'id': 'ermine-c0.insolar-plants.net', '_stamp': '2016-05-16T09:34:03.623898'}
2016-05-16 10:34:03,627 [salt.minion                                          ][DEBUG   ][1995] Handling event '_minion_mine\n\n\x85\xa5clear\xc2\xa3cmd\xa5_mine\xa4data\x80\xa2id\xbcermine-c0.insolar-plants.net\xa6_stamp\xba2016-05-16T09:34:03.623898'
2016-05-16 10:34:03,630 [salt.transport.zeromq                                ][DEBUG   ][1995] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506', 'aes')
2016-05-16 10:34:03,632 [salt.crypt                                           ][DEBUG   ][1995] Initializing new SAuth for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506')
2016-05-16 10:34:04,128 [salt.transport.zeromq                                ][DEBUG   ][7061] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506', 'aes')
2016-05-16 10:34:04,129 [salt.crypt                                           ][DEBUG   ][7061] Initializing new SAuth for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506')
2016-05-16 10:34:06,551 [salt.utils.schedule                                  ][DEBUG   ][7061] schedule.handle_func: Removing /var/cache/salt/minion/proc/20160516103403600217
(END) 
basepi commented 8 years ago

@cachedout Pinging you so we have a current core dev in the conversation.

cachedout commented 8 years ago

@sivann Hi there.

I'm not sure I would agree that the ZeroMQ transport is a "custom low-level implementation". It's a series of fairly well-defined patterns into PyZMQ using hooks provided by that library. Perhaps you just meant RAET and TCP? If so, I would agree there.

I'm aware of MQTT and one of the reasons that the salt transport system has been designed in a pluggable manner is to allow easy exploration into the feasibility of adding that support. It's not something that's on the roadmap at present.

Regarding the issues you've been facing, we'd have to debug this as one would debug any other networking problem. We'd likely need to look at packet captures of failed minions and we'd need to know the state of the sockets on the minion side and the master. You didn't say whether this was on the publish side or the return side, so that's the first determination that we'd need to make...

sivann commented 8 years ago

@cachedout, thank you for your help, I will try to have packet captures.

sivann commented 7 years ago

We also tried the new TCP transport, same behaviour, minion was lost after 3 days. Minion's side shows ESTABLISHED, no logs, master's side shows no connection. As I stated before TCP keepalives are not trustworthy, they are terminated locally in some networks.

@cachedout There is nothing to capture, stracing the minion just polls its socket:

[pid 29931] gettimeofday({1488274468, 901392}, NULL) = 0
[pid 29931] clock_gettime(CLOCK_MONOTONIC, {677798, 684207835}) = 0
[pid 29931] poll([{fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=16, events=POLLIN}, {fd=18, events=POLLIN}], 6, 0) = 0 (Timeout)
[pid 29931] poll([{fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=16, events=POLLIN}, {fd=18, events=POLLIN}], 6, 995) = 0 (Timeout)
[pid 29931] clock_gettime(CLOCK_MONOTONIC, {677799, 681042172}) = 0
[pid 29931] gettimeofday({1488274469, 898725}, NULL) = 0
[pid 29931] gettimeofday({1488274469, 899251}, NULL) = 0
[pid 29931] waitpid(29932, 0xbfe1f7f4, WNOHANG) = 0
[pid 29931] gettimeofday({1488274469, 899689}, NULL) = 0
[pid 29931] gettimeofday({1488274469, 900043}, NULL) = 0
[pid 29931] gettimeofday({1488274469, 900350}, NULL) = 0
[pid 29931] gettimeofday({1488274469, 900561}, NULL) = 0
[pid 29931] clock_gettime(CLOCK_MONOTONIC, {677799, 683313820}) = 0
[pid 29931] poll([{fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=16, events=POLLIN}, {fd=18, events=POLLIN}], 6, 0) = 0 (Timeout)
[pid 29931] poll([{fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=16, events=POLLIN}, {fd=18, events=POLLIN}], 6, 996) = 0 (Timeout)
[pid 29931] clock_gettime(CLOCK_MONOTONIC, {677800, 681053958}) = 0
[pid 29931] gettimeofday({1488274470, 898734}, NULL) = 0
[pid 29931] gettimeofday({1488274470, 899271}, NULL) = 0
[pid 29931] waitpid(29932, 0xbfe1f7f4, WNOHANG) = 0
[pid 29931] gettimeofday({1488274470, 899710}, NULL) = 0
[pid 29931] gettimeofday({1488274470, 900071}, NULL) = 0
[pid 29931] gettimeofday({1488274470, 900379}, NULL) = 0
[pid 29931] gettimeofday({1488274470, 900598}, NULL) = 0
[pid 29931] clock_gettime(CLOCK_MONOTONIC, {677800, 683347726}) = 0
[pid 29931] poll([{fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=16, events=POLLIN}, {fd=18, events=POLLIN}], 6, 0) = 0 (Timeout)
[pid 29931] poll([{fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=16, events=POLLIN}, {fd=18, events=POLLIN}], 6, 995) = 0 (Timeout)
[pid 29931] clock_gettime(CLOCK_MONOTONIC, {677801, 680110054}) = 0
[pid 29931] gettimeofday({1488274471, 898951}, NULL) = 0
[pid 29931] gettimeofday({1488274471, 900609}, NULL) = 0
[pid 29931] waitpid(29932, 0xbfe1f7f4, WNOHANG) = 0

Please do not implement more access methods, please just fix one of the current implementations. If you don't use your own keepalive payload this will always be broken. Will a support contract help you move this forward? We have invested too much time in salt and we are very sorry to see this basic functionality broken.

PeterS242 commented 7 years ago

@sivann, I am seeing/troubleshooting this as well on my network. We have a DMZ that has been closing connections from our non-DMZ Master to the DMZ Minions.

When you strace the Minion and it shows that it "just polls its socket", what do you see in the logs on the Minion? Are you running the option "log_level_logfile: all" or equivalent? Or, what are the last actions logged by the Minion? Also, what does the following command give you?

ss -nao | grep -P ':450(5|6)\s+'

In my case, I believe that our DMZ firewall is ignoring the TCP Keepalives that are going through, and since we do not have a scheduled Master->Minion ping (yet), the connection gets destroyed by the firewall after about 60 minutes due to what it considers to be inactivity. I am still troubleshooting however.

PeterS242 commented 7 years ago

Here is an update.

On my non-firewalled network I have minions that still report timeouts, but recover adequately.

2017-03-05 13:20:12,920 [salt.transport.zeromq                                ][DEBUG   ][27814] SaltReqTimeoutError, retrying. (1/7)
2017-03-05 13:20:17,920 [salt.transport.zeromq                                ][DEBUG   ][27814] SaltReqTimeoutError, retrying. (2/7)
2017-03-05 13:20:22,920 [salt.transport.zeromq                                ][DEBUG   ][27814] SaltReqTimeoutError, retrying. (3/7)

With this in mind I decided to configure my DMZ-firewalled systems to have an increased "auth_tries" value of 20 instead of the default of 7. This seems to have done the trick for me. I could not get the "Set timeouts really low to force the code to restart the salt-minion child" method to work--I always wound up with my DMZ-firewalled systems running only the parent when it attempted, and failed, to restart the child.

Here is what my DMZ-firewalled systems report in their logs, for timeouts:

2017-03-05 13:20:18,155 [salt.transport.zeromq                                ][DEBUG   ][2841] SaltReqTimeoutError, retrying. (1/20)
2017-03-05 13:20:28,157 [salt.transport.zeromq                                ][DEBUG   ][2841] SaltReqTimeoutError, retrying. (2/20)
2017-03-05 13:20:38,158 [salt.transport.zeromq                                ][DEBUG   ][2841] SaltReqTimeoutError, retrying. (3/20)

Hope that helps someone.

sivann commented 7 years ago

@PeterS242 Hi peter, as I stated above, keepalive counters count on one end (the minion's) because on SAT networks keepalive is proxied (emulated). Nothing on the minion logs.

sivann commented 7 years ago

There are some good news: testing the TCP transport with the latest 2016.11.x, we have 2 complete weeks that all 3 test minions are reachable.

emeryray02 commented 7 years ago

At Link Labs we are facing a similar issue with minions on 3G and Satcom connections. We have the minion running in supervisor, which helps some, but it would be very helpful to get a fuller picture of what minion/master settings you used to achieve your reliability. @sivann

riq-dfaubion commented 7 years ago

I'm also for application level keepalives. 👍

I also second @emeryray02's request for more details @sivann.

I have a master in AWS and minions in the field. After several hours, the minions will lose connectivity to the salt master, but the minion will not realize it. I have configured reasonable TCP keepalives, but AWS does not seem to be respecting them.

nuharaf commented 6 years ago

ZeroMQ 4.2 include heartbeat functionality, maybe that can be used

sivann commented 6 years ago

@riq-dfaubion we used the TCP transport for experiment, i.e. "transport: tcp" not the 0mq; that seemed to work in a short test. But we have not found the time to switch over, we have > 1000 minions. Since there are several issues that prevent me for trusting salt for orchestration (several broken releases, version incompatibilities, short-term O/S support, half-baked LDAP). We just use it for commissioning now.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.