[BUG] 3003.1 tcp_keepalive probes not sent to idle minion if any connection is not idle

gbunt commented 3 years ago

Description TCP keepalive probes are sent to a minion if a zmq connection is idle for tcp_keepalive_idle time. This works as designed if there's no activity at all on the master and salt/presence/change events are seen on the event bus after tcp_keepalive_cnt * tcp_keepalive_intvl time a minion node does not return an ack packet.

However, if there's any activity on the zmq socket between master and any minion, even if that's not the minion that became unreachable, keepalive probes stop beings sent, and a node down is not being detected by the presence system. Also, while that node remains unreachable but the activity on the zmq socket stopped, no new keepalive probes are sent to the unreachable node, resulting in a missed presence/change event and the node never being added to manage.not_alived.

Setup Linux tight-tiger-node1 5.4.0-52-generic #57-Ubuntu SMP Thu Oct 15 10:57:00 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Relevant master config:

tcp_keepalive: True
tcp_keepalive_idle: 10
tcp_keepalive_cnt: 3
tcp_keepalive_intvl: 10

Steps to Reproduce the behavior

NOTE: 10.43.128.111 is master here, 10.43.128.114 one of the minions

Run a tcpdump capturing keepalive packets on master, eg:

$ tcpdump -ni any "host 10.43.128.114 and (host 10.43.128.111 and tcp port 4505) and ( tcp[tcpflags] == tcp-ack and len == 52 )"
13:15:40.902991 IP 10.43.128.111.4505 > 10.43.128.114.60228: Flags [.], ack 1, win 128, options [nop,nop,TS val 346993721 ecr 2634232174], length 0
13:15:40.903282 IP 10.43.128.114.60228 > 10.43.128.111.4505: Flags [.], ack 7184, win 348, options [nop,nop,TS val 2634242418 ecr 346809421], length 0

On the master, issue any module.function in a loop targeting any of the other minions, eg:
```
$ while true; do salt -G 'ipv4:10.43.128.111' cmd.run hostname; sleep 5; done
```

tcpdump now shows that keepalive probes aren't sent anymore from master to minion, we only see the ack packets from minion to master:

14:13:23.784085 IP 10.43.128.114.60228 > 10.43.128.111.4505: Flags [.], ack 8371, win 348, options [nop,nop,TS val 2637705299 ecr 350456602], length 0
14:13:26.054971 IP 10.43.128.114.60228 > 10.43.128.111.4505: Flags [.], ack 8836, win 348, options [nop,nop,TS val 2637707569 ecr 350458872], length 0
14:13:28.387538 IP 10.43.128.114.60228 > 10.43.128.111.4505: Flags [.], ack 9301, win 348, options [nop,nop,TS val 2637709902 ecr 350461205], length 0
14:13:30.619855 IP 10.43.128.114.60228 > 10.43.128.111.4505: Flags [.], ack 9766, win 348, options [nop,nop,TS val 2637712134 ecr 350463437], length 0
14:13:32.970185 IP 10.43.128.114.60228 > 10.43.128.111.4505: Flags [.], ack 10231, win 348, options [nop,nop,TS val 2637714485 ecr 350465788], length 0

Take down network on minion (10.43.128.114 in this case)
No salt/presence/change events are seen on the event bus
Abort the cmd.run loop on master. No salt/presence/change will be seen. Outage of a node has now gone unnoticed.

Expected behavior Keepalive probes are sent on connections idle time exceeding tcp_keepalive_idle, regardless of other minion zmq activity.

Screenshots NA

Versions Report

salt --versions-report

(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.) ``` Salt Version: Salt: 3002.2 Dependency Versions: cffi: Not Installed cherrypy: Not Installed dateutil: Not Installed docker-py: Not Installed gitdb: Not Installed gitpython: Not Installed Jinja2: 2.8.1 libgit2: Not Installed M2Crypto: 0.33.0 Mako: Not Installed msgpack: 0.6.2 msgpack-pure: Not Installed mysql-python: Not Installed pycparser: Not Installed pycrypto: Not Installed pycryptodome: Not Installed pygit2: Not Installed Python: 3.6.8 (default, Nov 16 2020, 16:55:22) python-gnupg: Not Installed PyYAML: 3.11 PyZMQ: 17.0.0 smmap: Not Installed timelib: Not Installed Tornado: 4.5.3 ZMQ: 4.1.4 System Versions: dist: centos 7 Core locale: UTF-8 machine: x86_64 release: 5.4.0-52-generic system: Linux version: CentOS Linux 7 Core ```

Additional context On a side-note, it seems the presence system is pretty disconnected from the rest and completely dependent on tcp_keepalive. In case a presence event isn't picked up (yet), i would expect if salt "actively" finds a node is unresponsive, it would issue a salt/presence/change event as well. Example:

Node goes down
Before tcp_keepalive timeout is exceeded one manually runs a salt-run manage.down (or anything else that will detect a minion is not responding)
Master knows the node is not responding -> issue a salt/presence/change event

welcome[bot] commented 3 years ago

Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey. Please be sure to review our Code of Conduct. Also, check out some of our community resources including:

There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar. If you have additional questions, email us at saltproject@vmware.com. We’re glad you’ve joined our community and look forward to doing awesome things with you!

sagetherage commented 3 years ago

@gbunt The reported Salt version is vulnerable. We recommend upgrading as soon as possible to at least the latest point release of v3002: https://github.com/saltstack/salt/releases/tag/v3002.6 or to latest. You can find those on releases here in GitHub or the package repo. Likely your issue will persist, but if you are able to upgrade can you please confirm the issue is still seen, please?

gbunt commented 3 years ago

thanks @sagetherage, these are disposable dev environments, and we'll surely provision latest when going for a production release. For now i've upgraded master and all minions to:

Salt Version:
          Salt: 3003.1

The behavior is indeed also reproducible on this version

waynew commented 3 years ago

@gbunt thanks for opening this issue! I'll be taking a look to try and reproduce this locally, looks like it should be pretty straightforward...

waynew commented 3 years ago

I'm going to just say that the manage.alived process is flat out broken.

Spun up a salt master and two minions with docker compose. Used docker network disconnect <network> <minion2>.

Now on the salt master, run manage.alived... and it showed that the disconnected minion was alive.

Tried test.ping. Tried ping -c1 -W5 <minion_container_ip> on the salt-master container and they appropriately failed.

And yet... manage.alived still says it's up. What's even more hilariously awful - manage.down shows the downed minion, but manage.alived does, too. :disappointed:

This is seriously buggy.

gbunt commented 3 years ago

@waynew have you tested with the same tcp_keepalive master settings? And have you waited until tcp_keepalive_cnt * tcp_keepalive_intvl exceeded? In any case, it does seem like a different issue as to what we're seeing, as in general the presence system does do it's job in our environment, but only as long as there's no zmq activity at all, from any minion. If there is, minions dropping off will go unnoticed.

gbunt commented 1 month ago

Any news on this? We're still seeing this all over with v3006.9, minions that respond to a test.ping but are not listed under manage.alived and vice-versa. Now also combining with master failover and failback it seems to be completely unreliable, after a failover the master is often seen only listing one minion in manage.alived while 6 are connected. In our current setup we look at whether a majority of minions are connected to master as seen from the presence system, before we start running certain jobs or orchestrations, so that's completely unreliable now.

Any hints here, is there a way to get the presence system in sync somehow?

saltstack / salt

[BUG] 3003.1 tcp_keepalive probes not sent to idle minion if any connection is not idle #60428