saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Get access to the Salt software package repository here:
https://repo.saltproject.io/
Apache License 2.0
14.13k stars 5.47k forks source link

mine.get doesn't return minion data as requested in matches #15673

Closed deuscapturus closed 10 years ago

deuscapturus commented 10 years ago

Using mine.get doesn't return data for all of my matches and always returns data from the same 3 minions no matter what my match request is. I've already cleared the mine cache on the minions and master.

First Example: Shows the minions set with the grain roles:elasticsearch Second Example: Shows that only one of the two matching minions returned along with 3 other non-matching Third Example: Shows the 3 minions always returning as a match for any request.

First

[root@ip-10-1-0-116 master]# salt -G 'roles:elasticsearch' test.ping
ip-10-1-14-103.ec2.internal:
    True
ip-10-1-14-160.ec2.internal:
    True

Second

[root@ip-10-1-0-116 master]# salt 'ip-10-1-14-103.ec2.internal' mine.get 'roles:elasticsearch' 'network.interfaces' grain
ip-10-1-14-103.ec2.internal:
    ----------
    ip-10-1-14-103.ec2.internal:
        ----------
        eth0:
            ----------
            hwaddr:
                0a:52:f3:f2:e6:8f
            inet:
                ----------
                - address:
                    10.1.14.103
                - broadcast:
                    10.1.14.127
                - label:
                    eth0
                - netmask:
                    255.255.255.192
            inet6:
                ----------
                - address:
                    fe80::852:f3ff:fef2:e68f
                - prefixlen:
                    64
            up:
                True
        lo:
            ----------
            hwaddr:
                00:00:00:00:00:00
            inet:
                ----------
                - address:
                    127.0.0.1
                - broadcast:
                    None
                - label:
                    lo
                - netmask:
                    255.0.0.0
            inet6:
                ----------
                - address:
                    ::1
                - prefixlen:
                    128
            up:
                True
    ip-172-31-56-19.us-west-2.compute.internal:
        ----------
        eth0:
            ----------
            hwaddr:
                0a:07:06:d0:a6:ce
            inet:
                ----------
                - address:
                    172.31.56.19
                - broadcast:
                    172.31.63.255
                - label:
                    eth0
                - netmask:
                    255.255.240.0
            inet6:
                ----------
                - address:
                    fe80::807:6ff:fed0:a6ce
                - prefixlen:
                    64
            up:
                True
        lo:
            ----------
            hwaddr:
                00:00:00:00:00:00
            inet:
                ----------
                - address:
                    127.0.0.1
                - broadcast:
                    None
                - label:
                    lo
                - netmask:
                    255.0.0.0
            inet6:
                ----------
                - address:
                    ::1
                - prefixlen:
                    128
            up:
                True
    ip-172-31-60-160.us-west-2.compute.internal:
        ----------
        eth0:
            ----------
            hwaddr:
                0a:dc:3c:1d:03:cb
            inet:
                ----------
                - address:
                    172.31.60.160
                - broadcast:
                    172.31.63.255
                - label:
                    eth0
                - netmask:
                    255.255.240.0
            inet6:
                ----------
                - address:
                    fe80::8dc:3cff:fe1d:3cb
                - prefixlen:
                    64
            up:
                True
        lo:
            ----------
            hwaddr:
                00:00:00:00:00:00
            inet:
                ----------
                - address:
                    127.0.0.1
                - broadcast:
                    None
                - label:
                    lo
                - netmask:
                    255.0.0.0
            inet6:
                ----------
                - address:
                    ::1
                - prefixlen:
                    128
            up:
                True
    ip-172-31-62-70.us-west-2.compute.internal:
        ----------
        eth0:
            ----------
            hwaddr:
                0a:70:43:30:a9:b2
            inet:
                ----------
                - address:
                    172.31.62.70
                - broadcast:
                    172.31.63.255
                - label:
                    eth0
                - netmask:
                    255.255.240.0
            inet6:
                ----------
                - address:
                    fe80::870:43ff:fe30:a9b2
                - prefixlen:
                    64
            up:
                True
        lo:
            ----------
            hwaddr:
                00:00:00:00:00:00
            inet:
                ----------
                - address:
                    127.0.0.1
                - broadcast:
                    None
                - label:
                    lo
                - netmask:
                    255.0.0.0
            inet6:
                ----------
                - address:
                    ::1
                - prefixlen:
                    128
            up:
                True

Third

[root@ip-10-1-0-116 master]# salt 'ip-10-1-14-103.ec2.internal' mine.get 'roles:asdfasdfasdf' 'network.interfaces' grain
ip-10-1-14-103.ec2.internal:
    ----------
    ip-172-31-56-19.us-west-2.compute.internal:
        ----------
        eth0:
            ----------
            hwaddr:
                0a:07:06:d0:a6:ce
            inet:
                ----------
                - address:
                    172.31.56.19
                - broadcast:
                    172.31.63.255
                - label:
                    eth0
                - netmask:
                    255.255.240.0
            inet6:
                ----------
                - address:
                    fe80::807:6ff:fed0:a6ce
                - prefixlen:
                    64
            up:
                True
        lo:
            ----------
            hwaddr:
                00:00:00:00:00:00
            inet:
                ----------
                - address:
                    127.0.0.1
                - broadcast:
                    None
                - label:
                    lo
                - netmask:
                    255.0.0.0
            inet6:
                ----------
                - address:
                    ::1
                - prefixlen:
                    128
            up:
                True
    ip-172-31-60-160.us-west-2.compute.internal:
        ----------
        eth0:
            ----------
            hwaddr:
                0a:dc:3c:1d:03:cb
            inet:
                ----------
                - address:
                    172.31.60.160
                - broadcast:
                    172.31.63.255
                - label:
                    eth0
                - netmask:
                    255.255.240.0
            inet6:
                ----------
                - address:
                    fe80::8dc:3cff:fe1d:3cb
                - prefixlen:
                    64
            up:
                True
        lo:
            ----------
            hwaddr:
                00:00:00:00:00:00
            inet:
                ----------
                - address:
                    127.0.0.1
                - broadcast:
                    None
                - label:
                    lo
                - netmask:
                    255.0.0.0
            inet6:
                ----------
                - address:
                    ::1
                - prefixlen:
                    128
            up:
                True
    ip-172-31-62-70.us-west-2.compute.internal:
        ----------
        eth0:
            ----------
            hwaddr:
                0a:70:43:30:a9:b2
            inet:
                ----------
                - address:
                    172.31.62.70
                - broadcast:
                    172.31.63.255
                - label:
                    eth0
                - netmask:
                    255.255.240.0
            inet6:
                ----------
                - address:
                    fe80::870:43ff:fe30:a9b2
                - prefixlen:
                    64
            up:
                True
        lo:
            ----------
            hwaddr:
                00:00:00:00:00:00
            inet:
                ----------
                - address:
                    127.0.0.1
                - broadcast:
                    None
                - label:
                    lo
                - netmask:
                    255.0.0.0
            inet6:
                ----------
                - address:
                    ::1
                - prefixlen:
                    128
            up:
                True
cachedout commented 10 years ago

Hmm, that's curious. Could you please provide us with a --versions-report and we'll investigate this. Thanks.

deuscapturus commented 10 years ago

Hi @cachedout,

I was able to get the false positive matches to stop by running mine.update a few times. The problem with all of the minions not returning is due to my multi-master setup. This is the same as reported here https://github.com/saltstack/salt/issues/7697. Let's close this issue and work from 7679.

Thanks

[root@ip-10-1-0-116 master]# salt --versions-report Salt: 2014.1.10 Python: 2.6.9 (unknown, Mar 28 2014, 00:06:37) Jinja2: 2.7.2 M2Crypto: 0.20.2 msgpack-python: 0.1.13 msgpack-pure: Not Installed pycrypto: 2.6.1 PyYAML: 3.10 PyZMQ: 2.2.0.1 ZMQ: 3.2.4

cachedout commented 10 years ago

Ahhh, multi-master. Should have guessed. :]

Sounds good. We'll continue our discussion there. Thanks much!

deuscapturus commented 10 years ago

Again I'm having false matches with mine.get. I think I've ruled out this being a multi-master problem. I fixed this again by deleting all minion cache manually (salt-run cache.clearmine doesn't work) and clearing the minion mine cache 'salt '' mine.flush' and updating 'salt '_' mine.update'. I'll try to figure out why this is happening and report it here.

basepi commented 10 years ago

Thanks for the update @deuscapturus. Guess we've got more work to do.

deuscapturus commented 10 years ago

@cachedout and @basepi I found the problem. At some point over the weekend ALL of my minions with a mine_functions defined in the pillar started falsely matching with mine.get again.

Below returned as a match for all of minions with mine_functions defined

salt-call 'fakegrain:totallyfakevalue' 'network.interfaces' 'grain'

Here is the mine definition in the pillar for each minion

mine_functions:
  network.interfaces: []
  test.ping: []
  network.ip_addrs: [eth0]

All minions reporting false positives are missing /var/cache/salt/master/minions/{minion-id}/data.p. Once the data.p file is regenerated the minion no longer falsely match.

I regenerated the cache file with salt '*' saltutil.sync_all

  1. Why was data.p be missing?
  2. Why would the mine falsely match a minion when data.p is missing?

Thanks

deuscapturus commented 10 years ago

When data.p is missing from the master cache; the variable "match" returns empty. see below https://github.com/saltstack/salt/blob/af504f12a485979312c4e7310c6efacffdc5489d/salt/utils/__init__.py#L1228

 match = traverse_dict_and_list(data, key, {}, delimiter=delimiter)

If match is empty {} the _match() function correctly returns False. No problem here

I guess the problem is within https://github.com/saltstack/salt/blob/2fc36a31a08c90386538acdf39a1ef357fc63f72/salt/daemons/masterapi.py#L482

I sadly don't have to resources to debug the daemon.

iggy commented 10 years ago

We are seeing the same thing here. 2014.1.10. We have mine_functions set as well and we have a single master. One of the symptoms that we see when this starts happening is that "salt-call mine.get 'roles:reg' network.get_hostname grain" and "salt -G 'roles:reg' mine.get network.get_hostname" return 2 different lists. We are also on GCE and we may see general network disruptions around the same time that the minions go bonkers.

basepi commented 10 years ago

Wow, thanks for the great updates, @deuscapturus, they will definitely be useful. I'm really not sure why the minion caches are disappearing at times, we'll definitely investigate this.

deuscapturus commented 10 years ago

68% of my minions are missing data.p after 24 hours time. I had debug logging turned on the master in that 24 hour period and found nothing to explain this. Some of the deleted-data.p minions are configured with multi-master and some single master. Ran salt '*' saltutil.sync_all to fix it again.

This is a serious problem. I hope it's getting the attention it deserves.

deuscapturus commented 10 years ago

@iggy, Could you run 'ls /var/cache/salt/master/minions/*' from your master and tell me if any minions are missing the data.p cache file?

iggy commented 10 years ago

Not presently, but the next time we notice issues, I will make sure I (or one of the other guys) check.

cachedout commented 10 years ago

Hrm, I don't like this at all. Upgrading to high severity.

cachedout commented 10 years ago

Is salt-ssh in the mix here anywhere? In researching this issue, I discovered a bug wherein the use of salt-ssh may cause this exact issue.

deuscapturus commented 10 years ago

I've never used salt-ssh.

cachedout commented 10 years ago

OK, thanks. We'll keep digging.

iggy commented 10 years ago

We don't use salt-ssh either.

thatch45 commented 10 years ago

I know that the minion data cache is cleaned out when keys are deleted, so I am looking there, among other places

thatch45 commented 10 years ago

@deuscapturus @iggy: I think that it is in the check_minion_cache method in salt/key.py in the Key class. this cleans out the minion data caches for minions who's keys have been deleted. It would make sense that this is the issue if you are seeing this after deleting keys. I believe this is the only place this check is happening. I am still searching to see if I can reproduce more of this

thatch45 commented 10 years ago

Can you share your master config files with us please @deuscapturus @iggy ?

thatch45 commented 10 years ago

@deuscapturus @iggy this PR that I merged into 2014.7 gates the call to clean up the data cache when keys are deleted. Can you please apply this patch to your matser or gate the section that is config gated in salt/key.py and let us know if it helps resolve the problem? https://github.com/saltstack/salt/pull/16358

iggy commented 10 years ago

If I'm understanding this correctly, this doesn't actually fix what's wrong. It just wallpapers over it. I'm not even sure if it can be reasonably fixed. I'm guessing this ties into some other issues we've seen in the past. i.e. things like "salt '*' test.ping" only returning on some instances, having to run highstate a few times in a row, etc. I wonder if RAET might help us mitigate some of this, or if it's just something we'll have to live with in GCE.

thatch45 commented 10 years ago

This patch is primarily intended to help us narrow down what the underlying cause of this problem is. Your explanation in this post also leads me to believe that your issues are more focused on minions just not getting commands and less from the minion cache data issues. All in all this issue just requires some deeper digging to find what the real problem is.

deuscapturus commented 10 years ago

Note, I am running this cron job nightly:

Remove salt-keys from unavailable minions every day.

1 0 * * * salt-run manage.down removekeys=True

I'll look into the patch next

cachedout commented 10 years ago

@deuscapturus Any news on a test with the patch? (I know it was the weekend, but I'm just making my rounds on the issue tracker so I figured I'd ask.) Thanks!

iggy commented 10 years ago

Fwiw, we applied this patch on Friday and haven't seen any issues since then. That said, I think it's too early to say conclusively (at least for us, we could go days without seeing the issue, then bam).

deuscapturus commented 10 years ago

I've applied the patch to my master running the 2014.1.10-4 release. I'll know if it worked over the next couple days.

I've also noticed that (in AWS) salt-run manage.down removekeys=True will delete keys from minions if they have been idle for several hours. I use auto_accept: True in my master config, so my minions reconnect without any issue, but I do believe they are not regenerating their cache until I manually run a sync.

I think the fix will solve this for me. But we still need to fix the false positive match from mine.get when the minion cache is missing.

Thank you all for your urgency in resolving this issue.

cachedout commented 10 years ago

@deuscapturus Thanks for the update!

cachedout commented 10 years ago

OK, here is what I think the problem boils down to. I think this issue is actually in the matcher, and well outside of the mine code itself.

I think that what you're seeing is a condition in which the grains matcher is getting confused under the following conditions:

1) A minion key exists in the PKI dir 2) The minion data cache is enabled 3) No data.p cache has yet been generated

In that case, the master doesn't yet know what the minions grains are but it assumes that it will get them at some point in the future. Therefore, it matches the grain even though it doesn't actually have proof of its existence. Normally, this extra match isn't much of a problem because additional verification is performed on the minion and all that really happens on the master end is that it might wait a little longer for a minion to return that won't end up replying.

However, in the case of the mine, things are different since the match will end up being used blindly by the master to return mine data.

I think what we may need to do here is to add some flags to the matchers that can be used by the mine (or any other master-only) operations that tell them to be more conservative in their matching.

Before we do, however, I wanted to run this past you and see if this reasoning matches (so to speak!) your experience. :]

UPDATE: Here's my area of concern.. https://github.com/saltstack/salt/blob/2014.1/salt/utils/minions.py#L175

thatch45 commented 10 years ago

I agree, this sounds like the right approach.

cachedout commented 10 years ago

16446 for a non-greedy grain matcher.

@deuscapturus I think it is unlikely that this will make it into the .1 branch with 2014.7 imminent but if you want to backport it to test your own installation, it should be fairly trivial. Let me know if you have questions.

deuscapturus commented 10 years ago

16358 has resolved the problem for me. I look forward to seeing https://github.com/saltstack/salt/pull/16446 merged.

Thanks all.

cachedout commented 10 years ago

Thanks for the reply, @deuscapturus. I'll go ahead and close this.

deuscapturus commented 9 years ago

This problem is back, I did recently update to 2015.5.5

           Salt: 2015.5.5
         Python: 2.6.9 (unknown, Apr  1 2015, 18:16:00)
         Jinja2: 2.7.3
       M2Crypto: 0.21.1
 msgpack-python: 0.4.6
   msgpack-pure: Not Installed
       pycrypto: 2.6.1
        libnacl: Not Installed
         PyYAML: 3.11
          ioflo: Not Installed
          PyZMQ: 14.3.1
           RAET: Not Installed
            ZMQ: 3.2.5
           Mako: Not Installed
        Tornado: Not Installed
        timelib: Not Installed
       dateutil: 2.1
# salt 'haproxy-staging' mine.get 'G@no:such and G@grains:exist' network.ip_addrs compound
haproxy-staging:
    ----------
    ps-dev-paring:
        - 192.168.4.18
basepi commented 9 years ago

@deuscapturus Because this issue has been closed for a year, would you mind opening a new one and referring back to this one? Thanks!