Closed deuscapturus closed 10 years ago
Hmm, that's curious. Could you please provide us with a --versions-report
and we'll investigate this. Thanks.
Hi @cachedout,
I was able to get the false positive matches to stop by running mine.update a few times. The problem with all of the minions not returning is due to my multi-master setup. This is the same as reported here https://github.com/saltstack/salt/issues/7697. Let's close this issue and work from 7679.
Thanks
[root@ip-10-1-0-116 master]# salt --versions-report Salt: 2014.1.10 Python: 2.6.9 (unknown, Mar 28 2014, 00:06:37) Jinja2: 2.7.2 M2Crypto: 0.20.2 msgpack-python: 0.1.13 msgpack-pure: Not Installed pycrypto: 2.6.1 PyYAML: 3.10 PyZMQ: 2.2.0.1 ZMQ: 3.2.4
Ahhh, multi-master. Should have guessed. :]
Sounds good. We'll continue our discussion there. Thanks much!
Again I'm having false matches with mine.get. I think I've ruled out this being a multi-master problem. I fixed this again by deleting all minion cache manually (salt-run cache.clearmine doesn't work) and clearing the minion mine cache 'salt '' mine.flush' and updating 'salt '_' mine.update'. I'll try to figure out why this is happening and report it here.
Thanks for the update @deuscapturus. Guess we've got more work to do.
@cachedout and @basepi I found the problem. At some point over the weekend ALL of my minions with a mine_functions defined in the pillar started falsely matching with mine.get again.
Below returned as a match for all of minions with mine_functions defined
salt-call 'fakegrain:totallyfakevalue' 'network.interfaces' 'grain'
Here is the mine definition in the pillar for each minion
mine_functions:
network.interfaces: []
test.ping: []
network.ip_addrs: [eth0]
All minions reporting false positives are missing /var/cache/salt/master/minions/{minion-id}/data.p. Once the data.p file is regenerated the minion no longer falsely match.
I regenerated the cache file with salt '*' saltutil.sync_all
Thanks
When data.p is missing from the master cache; the variable "match" returns empty. see below https://github.com/saltstack/salt/blob/af504f12a485979312c4e7310c6efacffdc5489d/salt/utils/__init__.py#L1228
match = traverse_dict_and_list(data, key, {}, delimiter=delimiter)
If match is empty {} the _match() function correctly returns False. No problem here
I guess the problem is within https://github.com/saltstack/salt/blob/2fc36a31a08c90386538acdf39a1ef357fc63f72/salt/daemons/masterapi.py#L482
I sadly don't have to resources to debug the daemon.
We are seeing the same thing here. 2014.1.10. We have mine_functions set as well and we have a single master. One of the symptoms that we see when this starts happening is that "salt-call mine.get 'roles:reg' network.get_hostname grain" and "salt -G 'roles:reg' mine.get network.get_hostname" return 2 different lists. We are also on GCE and we may see general network disruptions around the same time that the minions go bonkers.
Wow, thanks for the great updates, @deuscapturus, they will definitely be useful. I'm really not sure why the minion caches are disappearing at times, we'll definitely investigate this.
68% of my minions are missing data.p after 24 hours time. I had debug logging turned on the master in that 24 hour period and found nothing to explain this. Some of the deleted-data.p minions are configured with multi-master and some single master. Ran salt '*' saltutil.sync_all to fix it again.
This is a serious problem. I hope it's getting the attention it deserves.
@iggy, Could you run 'ls /var/cache/salt/master/minions/*' from your master and tell me if any minions are missing the data.p cache file?
Not presently, but the next time we notice issues, I will make sure I (or one of the other guys) check.
Hrm, I don't like this at all. Upgrading to high severity.
Is salt-ssh
in the mix here anywhere? In researching this issue, I discovered a bug wherein the use of salt-ssh
may cause this exact issue.
I've never used salt-ssh.
OK, thanks. We'll keep digging.
We don't use salt-ssh either.
I know that the minion data cache is cleaned out when keys are deleted, so I am looking there, among other places
@deuscapturus @iggy: I think that it is in the check_minion_cache method in salt/key.py in the Key class. this cleans out the minion data caches for minions who's keys have been deleted. It would make sense that this is the issue if you are seeing this after deleting keys. I believe this is the only place this check is happening. I am still searching to see if I can reproduce more of this
Can you share your master config files with us please @deuscapturus @iggy ?
@deuscapturus @iggy this PR that I merged into 2014.7 gates the call to clean up the data cache when keys are deleted. Can you please apply this patch to your matser or gate the section that is config gated in salt/key.py and let us know if it helps resolve the problem? https://github.com/saltstack/salt/pull/16358
If I'm understanding this correctly, this doesn't actually fix what's wrong. It just wallpapers over it. I'm not even sure if it can be reasonably fixed. I'm guessing this ties into some other issues we've seen in the past. i.e. things like "salt '*' test.ping" only returning on some instances, having to run highstate a few times in a row, etc. I wonder if RAET might help us mitigate some of this, or if it's just something we'll have to live with in GCE.
This patch is primarily intended to help us narrow down what the underlying cause of this problem is. Your explanation in this post also leads me to believe that your issues are more focused on minions just not getting commands and less from the minion cache data issues. All in all this issue just requires some deeper digging to find what the real problem is.
Note, I am running this cron job nightly:
1 0 * * * salt-run manage.down removekeys=True
I'll look into the patch next
@deuscapturus Any news on a test with the patch? (I know it was the weekend, but I'm just making my rounds on the issue tracker so I figured I'd ask.) Thanks!
Fwiw, we applied this patch on Friday and haven't seen any issues since then. That said, I think it's too early to say conclusively (at least for us, we could go days without seeing the issue, then bam).
I've applied the patch to my master running the 2014.1.10-4 release. I'll know if it worked over the next couple days.
I've also noticed that (in AWS) salt-run manage.down removekeys=True
will delete keys from minions if they have been idle for several hours. I use auto_accept: True
in my master config, so my minions reconnect without any issue, but I do believe they are not regenerating their cache until I manually run a sync.
I think the fix will solve this for me. But we still need to fix the false positive match from mine.get
when the minion cache is missing.
Thank you all for your urgency in resolving this issue.
@deuscapturus Thanks for the update!
OK, here is what I think the problem boils down to. I think this issue is actually in the matcher, and well outside of the mine code itself.
I think that what you're seeing is a condition in which the grains matcher is getting confused under the following conditions:
1) A minion key exists in the PKI dir 2) The minion data cache is enabled 3) No data.p cache has yet been generated
In that case, the master doesn't yet know what the minions grains are but it assumes that it will get them at some point in the future. Therefore, it matches the grain even though it doesn't actually have proof of its existence. Normally, this extra match isn't much of a problem because additional verification is performed on the minion and all that really happens on the master end is that it might wait a little longer for a minion to return that won't end up replying.
However, in the case of the mine, things are different since the match will end up being used blindly by the master to return mine data.
I think what we may need to do here is to add some flags to the matchers that can be used by the mine (or any other master-only) operations that tell them to be more conservative in their matching.
Before we do, however, I wanted to run this past you and see if this reasoning matches (so to speak!) your experience. :]
UPDATE: Here's my area of concern.. https://github.com/saltstack/salt/blob/2014.1/salt/utils/minions.py#L175
I agree, this sounds like the right approach.
@deuscapturus I think it is unlikely that this will make it into the .1 branch with 2014.7 imminent but if you want to backport it to test your own installation, it should be fairly trivial. Let me know if you have questions.
Thanks all.
Thanks for the reply, @deuscapturus. I'll go ahead and close this.
This problem is back, I did recently update to 2015.5.5
Salt: 2015.5.5
Python: 2.6.9 (unknown, Apr 1 2015, 18:16:00)
Jinja2: 2.7.3
M2Crypto: 0.21.1
msgpack-python: 0.4.6
msgpack-pure: Not Installed
pycrypto: 2.6.1
libnacl: Not Installed
PyYAML: 3.11
ioflo: Not Installed
PyZMQ: 14.3.1
RAET: Not Installed
ZMQ: 3.2.5
Mako: Not Installed
Tornado: Not Installed
timelib: Not Installed
dateutil: 2.1
# salt 'haproxy-staging' mine.get 'G@no:such and G@grains:exist' network.ip_addrs compound
haproxy-staging:
----------
ps-dev-paring:
- 192.168.4.18
@deuscapturus Because this issue has been closed for a year, would you mind opening a new one and referring back to this one? Thanks!
Using mine.get doesn't return data for all of my matches and always returns data from the same 3 minions no matter what my match request is. I've already cleared the mine cache on the minions and master.
First Example: Shows the minions set with the grain roles:elasticsearch Second Example: Shows that only one of the two matching minions returned along with 3 other non-matching Third Example: Shows the 3 minions always returning as a match for any request.
First
Second
Third