Open gbunt opened 3 years ago
Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey. Please be sure to review our Code of Conduct. Also, check out some of our community resources including:
There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar. If you have additional questions, email us at saltproject@vmware.com. We’re glad you’ve joined our community and look forward to doing awesome things with you!
@gbunt The reported Salt version is vulnerable. We recommend upgrading as soon as possible to at least the latest point release of v3002: https://github.com/saltstack/salt/releases/tag/v3002.6 or to latest. You can find those on releases here in GitHub or the package repo. Likely your issue will persist, but if you are able to upgrade can you please confirm the issue is still seen, please?
thanks @sagetherage, these are disposable dev environments, and we'll surely provision latest when going for a production release. For now i've upgraded master and all minions to:
Salt Version:
Salt: 3003.1
The behavior is indeed also reproducible on this version
@gbunt thanks for opening this issue! I'll be taking a look to try and reproduce this locally, looks like it should be pretty straightforward...
I'm going to just say that the manage.alived
process is flat out broken.
Spun up a salt master and two minions with docker compose. Used docker network disconnect <network> <minion2>
.
Now on the salt master, run manage.alived... and it showed that the disconnected minion was alive.
Tried test.ping. Tried ping -c1 -W5 <minion_container_ip>
on the salt-master container and they appropriately failed.
And yet... manage.alived
still says it's up. What's even more hilariously awful - manage.down shows the downed minion, but manage.alived does, too. :disappointed:
This is seriously buggy.
@waynew have you tested with the same tcp_keepalive
master settings? And have you waited until tcp_keepalive_cnt
* tcp_keepalive_intvl
exceeded?
In any case, it does seem like a different issue as to what we're seeing, as in general the presence system does do it's job in our environment, but only as long as there's no zmq activity at all, from any minion. If there is, minions dropping off will go unnoticed.
Any news on this? We're still seeing this all over with v3006.9, minions that respond to a test.ping
but are not listed under manage.alived
and vice-versa. Now also combining with master failover and failback it seems to be completely unreliable, after a failover the master is often seen only listing one minion in manage.alived
while 6 are connected. In our current setup we look at whether a majority of minions are connected to master as seen from the presence system, before we start running certain jobs or orchestrations, so that's completely unreliable now.
Any hints here, is there a way to get the presence system in sync somehow?
Description TCP keepalive probes are sent to a minion if a zmq connection is idle for
tcp_keepalive_idle
time. This works as designed if there's no activity at all on the master andsalt/presence/change
events are seen on the event bus aftertcp_keepalive_cnt
*tcp_keepalive_intvl
time a minion node does not return an ack packet.However, if there's any activity on the zmq socket between master and any minion, even if that's not the minion that became unreachable, keepalive probes stop beings sent, and a node down is not being detected by the presence system. Also, while that node remains unreachable but the activity on the zmq socket stopped, no new keepalive probes are sent to the unreachable node, resulting in a missed
presence/change
event and the node never being added tomanage.not_alived
.Setup
Linux tight-tiger-node1 5.4.0-52-generic #57-Ubuntu SMP Thu Oct 15 10:57:00 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Steps to Reproduce the behavior
tcpdump
capturing keepalive packets on master, eg:tcpdump
now shows that keepalive probes aren't sent anymore from master to minion, we only see the ack packets from minion to master:salt/presence/change
events are seen on the event buscmd.run
loop on master. Nosalt/presence/change
will be seen. Outage of a node has now gone unnoticed.Expected behavior Keepalive probes are sent on connections idle time exceeding
tcp_keepalive_idle
, regardless of other minion zmq activity.Screenshots NA
Versions Report
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.) ``` Salt Version: Salt: 3002.2 Dependency Versions: cffi: Not Installed cherrypy: Not Installed dateutil: Not Installed docker-py: Not Installed gitdb: Not Installed gitpython: Not Installed Jinja2: 2.8.1 libgit2: Not Installed M2Crypto: 0.33.0 Mako: Not Installed msgpack: 0.6.2 msgpack-pure: Not Installed mysql-python: Not Installed pycparser: Not Installed pycrypto: Not Installed pycryptodome: Not Installed pygit2: Not Installed Python: 3.6.8 (default, Nov 16 2020, 16:55:22) python-gnupg: Not Installed PyYAML: 3.11 PyZMQ: 17.0.0 smmap: Not Installed timelib: Not Installed Tornado: 4.5.3 ZMQ: 4.1.4 System Versions: dist: centos 7 Core locale: UTF-8 machine: x86_64 release: 5.4.0-52-generic system: Linux version: CentOS Linux 7 Core ```Additional context On a side-note, it seems the presence system is pretty disconnected from the rest and completely dependent on tcp_keepalive. In case a presence event isn't picked up (yet), i would expect if salt "actively" finds a node is unresponsive, it would issue a
salt/presence/change
event as well. Example:salt-run manage.down
(or anything else that will detect a minion is not responding)salt/presence/change
event