Closed sivann closed 5 years ago
What version of salt are you running and on which OSes?
Can you provide the out put of salt 'minion id' test.versions_report? On May 5, 2014 6:49 AM, "Spiros Ioannou" notifications@github.com wrote:
Hello, The problem:
we have now about 100 salt-minions which are installed in remote areas with 3G and satellite connections.
We loose connectivity with all of those minions in about 1-2 days after installation, with test.ping reporting "minion did not return". The state was each time that the minions saw an ESTABLISHED TCP connection, while on the salt-master there were no connection listed at all. (Yes that is correct). Tighter keepalive settings were tried with no result. (OS is linux) Each time, restarting the salt-minion fixes the problem immediately.
Obviously the connections are transparently proxied someplace, (who knows what happens with those SAT networks) so the whole tcp-keepalive mechanism of 0mq fails.
Salt should handle this on the application level, so as to determine connection health and reconnect if needed by e.g. sending a dummy ping data sent periodically every e.g. 10 minutes and checking for reply.
— Reply to this email directly or view it on GitHubhttps://github.com/saltstack/salt/issues/12540 .
All minions are on debian squeeze, salt-minion versions: 2014.1.1+ds-1~bpo60+1 (freshly installed), with keepalive counters counting normaly (when executing ss -e or netstat -ean) Salt-master is 2014.1.3+ds-2trusty2, ubuntu 14.04
test.versions won't work since they are all unreachable, but for one I restarted manually I got this:
Salt: 2014.1.1
Python: 2.6.6 (r266:84292, Dec 27 2010, 00:02:40)
Jinja2: 2.5.5
M2Crypto: 0.20.1
msgpack-python: 0.1.10
msgpack-pure: Not Installed
pycrypto: 2.1.0
PyYAML: 3.09
PyZMQ: 13.1.0
ZMQ: 3.2.3
I also include a tcpdump from the minion. The master shows no connections, while the minion shows established.
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
12:59:37.751471 IP (tos 0x0, ttl 64, id 49407, offset 0, flags [DF], proto TCP (6), length 52)
10.11.32.161.58214 > xxxxxx.eu-west-1.compute.amazonaws.com.4505: Flags [.], cksum 0x9449 (correct), ack 1897141646, win 92, options [nop,nop,TS val 4774885 ecr 1689822158], length 0
12:59:37.755784 IP (tos 0x0, ttl 63, id 58924, offset 0, flags [DF], proto TCP (6), length 52)
xxxxxx.eu-west-1.compute.amazonaws.com.4505 > 10.11.32.161.58214: Flags [.], cksum 0xc7a4 (correct), ack 1, win 46, options [nop,nop,TS val 1690129362 ecr 3537089], length 0
13:04:37.755293 IP (tos 0x0, ttl 64, id 49408, offset 0, flags [DF], proto TCP (6), length 52)
10.11.32.161.58214 > xxxxxx.eu-west-1.compute.amazonaws.com.4505: Flags [.], cksum 0xbf46 (correct), ack 1, win 92, options [nop,nop,TS val 4849886 ecr 1690129362], length 0
13:04:37.762560 IP (tos 0x0, ttl 63, id 58925, offset 0, flags [DF], proto TCP (6), length 52)
xxxxxx.eu-west-1.compute.amazonaws.com.4505 > 10.11.32.161.58214: Flags [.], cksum 0x1798 (correct), ack 1, win 46, options [nop,nop,TS val 1690436570 ecr 3537089], length 0
13:09:37.759286 IP (tos 0x0, ttl 64, id 49409, offset 0, flags [DF], proto TCP (6), length 52)
10.11.32.161.58214 > xxxxxx.eu-west-1.compute.amazonaws.com.4505: Flags [.], cksum 0xea3f (correct), ack 1, win 92, options [nop,nop,TS val 4924887 ecr 1690436570], length 0
13:09:37.767481 IP (tos 0x0, ttl 63, id 58926, offset 0, flags [DF], proto TCP (6), length 52)
xxxxxx.eu-west-1.compute.amazonaws.com.4505 > 10.11.32.161.58214: Flags [.], cksum 0x678f (correct), ack 1, win 46, options [nop,nop,TS val 1690743774 ecr 3537089], length 0
(xxxxamazonaws.com is the master, 10.11.32.161 is the local ip of the minion)
I think the new RAET UDP transport will be the best answer for times like these. We won't have to worry about TCP reporting whether it's alive or not, it should handle latency a lot better.
(It will also give us a lot more application-side power and introspection into what's happening, so we can solve issues like this more easily. ZMQ tends to be a black box, much harder to debug these types of problems)
I believe changing the queue mechanism to resolve this is not the best answer. UDP is not guaranteed to work either in this case. I urge you not to combine the 2 issues, since I feel that this will only delay a possible fix. I don't see why implementing an application-level keepalive cannot be performed using 0mq, this is not a 0mq bug.
Oh, I agree that UDP is not a fix-all. The advantage of this new implementation is that it brings the queuing mechanisms much closer to the application level, which will allow us to much easier build application-level keepalive (which is a given, since UDP doesn't have its own keepalive).
We would love to have ZMQ application-level keepalive, and we have by no means written it off, but it will take extensive effort and we're going to wait until RAET is out in the wild and see what the reception is. We may even end up building a TCP transport for RAET to completely replace ZMQ, at which point application-level keepalive would be a given there as well.
Do you have any time estimate for this release? Does this 0mq replacement also means that all the clients will have to be updated manually? Because that would result in a huge IT effort even for the few hundrends of salt minions we have. There have been several cases in salt development that resulted in lost minions. Having a second mechanism to access those hosts really diminishes salt's usefulness. Sorry for being a bit bitter, I really appreciate your efforts and salt itself, but I feel that breaking compatibility so often is really not the way to go.
0MQ is not being replaced. This new release will just introduce an alternate transport mechanism which must be enabled to be in use. Nothing will change unless you want it to. Even people who want to switch to RAET should be able to do so without having to manually install. All it will take is upgrading the master and minions, ensuring that the proper RAET dependencies are installed on all the systems, and then switching first the minions and then the master over to RAET In their config.
All of that said, it will be a beta product in this next release, so we won't recommend immediately switching over an entire infrastructure or anything.
To answer your original question, we are targeting 2 weeks from now for the first release candidate.
@sivann, have you tried setting up the Salt master to run a test.ping on all your minions on a regular basis? Maybe once an hour or once every 10 minutes? Salt has a scheduler that allows you to do that. Some people have had success doing that.
@basepi thanks for clarifying that, 2 weeks is not too long, @UtahDave We thought of that, but in some cases we loose them even in 10 minutes. Another option is to restart salt-minion by cron hourly, but this seems an overkill.
I saw in the sources that salt-minion has a scheduler but couldn't find how to re-initialize the rabbitmq connection, so as to write a simple keepalive patch (to just send a message to the server); it seems it is only initialized once in the tune_in function so it was not so simple for me to patch.
sorry i never read the full thread. just reading through random issues.
if using the dev version of salt, perhaps something like this might help in the minion config.
master: none_mult_master_ip
ping_interval: 2
auth_timeout: 10
auth_tries: 2
auth_safemode: False
random_reauth_delay: 10
the above will have the minions 'ping' the master every 120 seconds ... if that ping fails and 5 re-auth retries fail then the minion restarts.
Thanks @steverweber this looks promising! I will try it.
Please let us/me know in a week from now if anything can be improved. Thanks.
correction ping_interval
is in minutes.. so every 2hours the minion pings master.
@sivann The auto restart code was patched. https://github.com/saltstack/salt/pull/13582
How things going? Is the salt deployment more stable now?
This is not yet released as of 2014.1.7, We just installed today's devel from github and will get back with results.
It seems it is not fixed, it's actually worse. The new code causes lots of bad EST connections on the master.
minion IP: 10.11.40.161, public: 176.227.142.126 saltmaster IP: 10.0.0.212, public: 54.246.180.52
minion:
root@debian:/usr/local/bin# netstat -ean |grep 54.246.180.52
tcp 0 0 10.11.40.161:43693 54.246.180.52:4505 ESTABLISHED 0 125808
root@debian:/usr/local/bin# ss -e|grep 54.246.180.52
ESTAB 0 0 10.11.40.161:43693 54.246.180.52:4505 timer:(keepalive,46sec,0) ino:125808 sk:f3512600
Master:
root@saltmaster:~ # netstat -ean |grep 176.227.142.126
tcp 0 0 10.0.0.212:4505 176.227.142.126:60215 ESTABLISHED 0 149365
tcp 0 0 10.0.0.212:4506 176.227.142.126:53544 ESTABLISHED 0 156632
tcp 0 0 10.0.0.212:4505 176.227.142.126:47687 ESTABLISHED 0 149367
tcp 0 0 10.0.0.212:4505 176.227.142.126:40874 ESTABLISHED 0 149360
tcp 0 1560 10.0.0.212:4505 176.227.142.126:37470 ESTABLISHED 0 149378
tcp 0 0 10.0.0.212:4506 176.227.142.126:53513 ESTABLISHED 0 156639
tcp 0 0 10.0.0.212:4505 176.227.142.126:54876 ESTABLISHED 0 149377
tcp 0 1560 10.0.0.212:4505 176.227.142.126:43693 ESTABLISHED 0 150116
tcp 0 0 10.0.0.212:4505 176.227.142.126:55295 ESTABLISHED 0 149362
tcp 0 0 10.0.0.212:4505 176.227.142.126:39531 ESTABLISHED 0 149361
tcp 0 0 10.0.0.212:4505 176.227.142.126:48655 ESTABLISHED 0 149363
strace on the minion at that time (logs show nothing useful):
[pid 14053] poll([{fd=10, events=POLLIN}, {fd=13, events=POLLIN}], 2, 1000) = 0 (Timeout)
[pid 14053] poll([{fd=10, events=POLLIN}], 1, 0) = 0 (Timeout)
[pid 14053] poll([{fd=13, events=POLLIN}], 1, 0) = 0 (Timeout)
[pid 14053] clock_gettime(CLOCK_MONOTONIC, {335431, 508535066}) = 0
[pid 14053] gettimeofday({1405065700, 720156}, NULL) = 0
[pid 14053] gettimeofday({1405065700, 720449}, NULL) = 0
[pid 14053] clock_gettime(CLOCK_MONOTONIC, {335431, 509479581}) = 0
[pid 14053] poll([{fd=10, events=POLLIN}, {fd=13, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 14053] poll([{fd=10, events=POLLIN}], 1, 0) = 0 (Timeout)
[pid 14053] poll([{fd=13, events=POLLIN}], 1, 0) = 0 (Timeout)
[pid 14053] clock_gettime(CLOCK_MONOTONIC, {335431, 509881187}) = 0
[pid 14053] poll([{fd=10, events=POLLIN}, {fd=13, events=POLLIN}], 2, 1000
... and so on for ever
It seems the minion shows 1 ESTABLISHED connection with the master, the master shows 9, and none actually works. What happened probably is that the minion tried to reconnect to the master, leaving all those fake ESTABLISHED connections on the master. For some reason reconnection was unsuccessful with the master since the minion does not respond to the master to salt commands. Perhaps the master does not know which connection is the right one?.
A suggestion is for the master to also ping the minions on established connections, and to close fake EST connections.
ya something was mucked up with that commit. I created a new fix that seems much more stable... https://github.com/saltstack/salt/pull/14064
I'll likely make only a small change to that before I give the go-ahead to merge. Testing is most welcome!
to test this patch you can do.
curl -o install_salt.sh.sh -L https://bootstrap.saltstack.com
sudo sh install_salt.sh.sh -g https://github.com/steverweber/salt.git git fix_restarts
I installed the version above, this version does not even connect to the master once, something's wrong.
Firstly it starts 2 salt-minion processes. And test.ping never works. I include the debug logfile:
2014-07-14 10:07:57,527 [salt ][INFO ] Setting up the Salt Minion "battens-c1.insolar-plants.net"
2014-07-14 10:07:57,534 [salt.utils.process][DEBUG ] Created pidfile: /var/run/salt-minion.pid
2014-07-14 10:07:57,537 [salt.config ][DEBUG ] Reading configuration from /etc/salt/minion
2014-07-14 10:07:57,799 [salt.config ][DEBUG ] Including configuration from '/etc/salt/minion.d/auth_timeout.conf'
2014-07-14 10:07:57,800 [salt.config ][DEBUG ] Reading configuration from /etc/salt/minion.d/auth_timeout.conf
2014-07-14 10:07:57,803 [salt.config ][DEBUG ] Including configuration from '/etc/salt/minion.d/id.conf'
2014-07-14 10:07:57,803 [salt.config ][DEBUG ] Reading configuration from /etc/salt/minion.d/id.conf
2014-07-14 10:07:57,806 [salt.config ][DEBUG ] Including configuration from '/etc/salt/minion.d/master.conf'
2014-07-14 10:07:57,807 [salt.config ][DEBUG ] Reading configuration from /etc/salt/minion.d/master.conf
2014-07-14 10:07:58,302 [salt.minion ][DEBUG ] Attempting to authenticate with the Salt Master at 54.246.180.52
2014-07-14 10:07:58,305 [salt.crypt ][DEBUG ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-14 10:08:02,290 [salt.crypt ][DEBUG ] Decrypting the current master AES key
2014-07-14 10:08:02,291 [salt.crypt ][DEBUG ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-14 10:08:04,158 [salt.minion ][INFO ] Authentication with master successful!
2014-07-14 10:08:06,970 [salt.crypt ][DEBUG ] Decrypting the current master AES key
2014-07-14 10:08:06,972 [salt.crypt ][DEBUG ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-14 10:08:11,594 [salt.crypt ][DEBUG ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-14 10:08:12,552 [salt.config ][DEBUG ] Reading configuration from /etc/salt/minion
2014-07-14 10:08:12,813 [salt.config ][DEBUG ] Including configuration from '/etc/salt/minion.d/auth_timeout.conf'
2014-07-14 10:08:12,814 [salt.config ][DEBUG ] Reading configuration from /etc/salt/minion.d/auth_timeout.conf
2014-07-14 10:08:12,817 [salt.config ][DEBUG ] Including configuration from '/etc/salt/minion.d/id.conf'
2014-07-14 10:08:12,818 [salt.config ][DEBUG ] Reading configuration from /etc/salt/minion.d/id.conf
2014-07-14 10:08:12,821 [salt.config ][DEBUG ] Including configuration from '/etc/salt/minion.d/master.conf'
2014-07-14 10:08:12,821 [salt.config ][DEBUG ] Reading configuration from /etc/salt/minion.d/master.conf
2014-07-14 10:08:13,990 [salt.utils.schedule ][INFO ] Added new job __mine_interval to scheduler
2014-07-14 10:08:13,991 [salt.minion ][DEBUG ] I am battens-c1.insolar-plants.net and I am not supposed to start any proxies. (Likely not a problem)
2014-07-14 10:08:13,991 [salt.minion ][INFO ] Minion is starting as user 'root'
2014-07-14 10:08:13,992 [salt.minion ][DEBUG ] Minion 'battens-c1.insolar-plants.net' trying to tune in
2014-07-14 10:08:13,994 [salt.minion ][DEBUG ] Minion PUB socket URI: ipc:///var/run/salt/minion/minion_event_ce16006525_pub.ipc
2014-07-14 10:08:13,995 [salt.minion ][DEBUG ] Minion PULL socket URI: ipc:///var/run/salt/minion/minion_event_ce16006525_pull.ipc
2014-07-14 10:08:13,995 [salt.minion ][INFO ] Starting pub socket on ipc:///var/run/salt/minion/minion_event_ce16006525_pub.ipc
2014-07-14 10:08:13,996 [salt.minion ][INFO ] Starting pull socket on ipc:///var/run/salt/minion/minion_event_ce16006525_pull.ipc
2014-07-14 10:08:13,997 [salt.minion ][DEBUG ] Generated random reconnect delay between '1000ms' and '11000ms' (2767)
2014-07-14 10:08:13,998 [salt.minion ][DEBUG ] Setting zmq_reconnect_ivl to '2767ms'
2014-07-14 10:08:13,999 [salt.minion ][DEBUG ] Setting zmq_reconnect_ivl_max to '11000ms'
@steverweber does the master expect pings from the minions? If not it would not forget the stale connections. I think the master must be aware of this "Ping". If the master does not receive pings from the minions it could close their connections.
The above log looks like the minion connects to the master at 54.246.180.52.
Authentication with master successful!
The master does not "expect" pings but rather accepts them. The minion sends pings at the ping_interval:
in minutes.
here is an aggressive configuration i use on my minions for testing.
master: ddns.name.com
ping_interval: 2
auth_timeout: 10
auth_tries: 2
auth_safemode: False
random_reauth_delay: 10
Yes, I know the minion thought it was successfully connected, but it wasn't. It seems just restarting the minion the way it is done now confuses the master somehow. I could not issue a single successful command to the minion with the above version, even after multiple restarts. Reinstalling the "stock" minion versions fixed this behaviour. Please tel me how to help debug further.
I found an issue when in daemon mode when running under a thread... however I don't think this would cause the minion to not respond to the master.
I pushed a new fix to fix_restarts it cleans up some issues.
It is strange though that there were multiple salt-minion processes running from the beggining.. Even after killing them and restarting there were again multiple salt-minion processes. Perhaps this is manifested if the network between minion and master is slow, as in our case.
You should see two process.
Once some tricky issues are solved then this solution can be one process.
The current version is holding up well on my systems. However I'm holding off my pull request until this solution works on your environment. Are you testing the latest version https://github.com/steverweber/salt/tree/fix_restarts that was pushed 2 days ago? Is it working out?
I will test tommorrow.
when testing this patch please disable your custom
tcp_keepalive_*
settings and reboot the system.
Thanks.
The keepalive patch has been merged to the develop branch.
Any
@steverweber sorry for the long delay, I'm ready to test again. Where could I find your latest code to test?
Ignore last comment, I'm testing latest dev branch.
hows the little minions behaving?
I run it in a minion that normaly gets lost in a few hours, and now with the dev version it's always responding on ocassional pings for the past 8 days. Sometimes on 2nd ping. I would say it's very good news. I'll install it on 2-3 more minions soon. Thanks.
@sivann can this issue be closed ?
Will your patch get released ? If yes then yes I consider it fixed. My minion still responds :-)
I think it's currently only in the develop branch, so that would make it set for the feature release after 2014.7.
Seeing similar issues to this. Is there a plan for this to make it into a release?
the patch /works/ but it's not elegant. personaly i would rather see the minion die hard and have the service manager/ systemd, upstart, whatever you have restart the minion.https://github.com/saltstack/salt/pull/22313
Does the keepalive patch (https://github.com/saltstack/salt/issues/12540#issuecomment-50223513) simply restart the minion? This was the patch I referred to.
it restarts the minion... but it's a minion restarting ones self. You will see two minion proccesses running ps
one that keeps the other one running. It was done thisway because salt code was not really built for rebuilding the minion object in the same proccess. /arg parsing and global objects are tricky/.
Looking back at the code It would be more simple to update all the different service launchers such as /systemd, upstart, init.d, launchd... to auto restart if the minion dies. https://github.com/saltstack/salt/pull/22313
Surely the better approach would be to resolve the reason for a restart being needed (minion stops communicating to the master). init.d, for example, has no auto restart ability and would need something like supervisord.
I agree, exiting the minion and relying on systemd/init/monit is also another source of technical issues: systemd timeouts, init muting the service, etc. Salt minion should be robust enough to cope with a simple network reconnection.
Well this is not completely fixed, although the ping does seem to work. I have: ping_interval: 90 auth_tries: 20 rejected_retry: True auth_safemode: False restart_on_error: True
All commands always fail at first and some in several subsequent tries. Not very reliable if you have thousands of minions. Looking forward to RAET in order to be able to actually benefit from saltstack, because in its current state we can only use it for 1-st time configurations/installations.
You might also try ping_on_rotate: True
in your master config so that it will automatically send a test.ping job after AES key rotate. That solves some of the "slow to respond" issues for some users.
I'm in a situation where VPN connections come and go, sometimes changing the IP address of the endpoint.
I think I can cope with this with the changes to the OpenVPN config that would do a minion restart on the VPN coming up, but of course I'm similarly interested in this open issue.
Also looking forward to RAET. Because loosing connections to minions is a painful experience.
@basepi thanks, I will try that.
Hello,
The problem:
we have now about 100 salt-minions which are installed in remote areas with 3G and satellite connections.
We loose connectivity with all of those minions in about 1-2 days after installation, with test.ping reporting "minion did not return". The state was each time that the minions saw an ESTABLISHED TCP connection, while on the salt-master there were no connection listed at all. (Yes that is correct). Tighter keepalive settings were tried with no result. (OS is linux) Each time, restarting the salt-minion fixes the problem immediately.
Obviously the connections are transparently proxied someplace, (who knows what happens with those SAT networks) so the whole tcp-keepalive mechanism of 0mq fails.
Suggestion:
Salt should handle this on the application level, so as to determine connection health and reconnect if needed by e.g. sending a dummy ping data periodically every e.g. 10 minutes and checking for valid reply. The only workarround we can see is restarting salt-minion hourly which is really ugly.