Minion did not return while using compound matching with custom grain.

roflmao commented 8 years ago

Description of Issue/Question

While using compound matching i dont get response from minion. (it matches the host)

root@salt-001:~# salt -v -C G@facter_role:docker test.ping
Executing job with jid 20160815103816026593
-------------------------------------------

dn-080.X.no:
    Minion did not return. [No response]

-------------------------------------------
Summary
-------------------------------------------
# of minions targeted: 1
# of minions returned: 0
# of minions that did not return: 1
# of minions with errors: 0
Minions not responding: dn-080.X.no
-------------------------------------------
root@salt-001:~# salt dn-080* test.ping
dn-080.X.no:
    True

-------------------------------------------
Summary
-------------------------------------------
# of minions targeted: 1
# of minions returned: 1
# of minions that did not return: 0
# of minions with errors: 0
-------------------------------------------

Setup

2016.3.2 with Ubuntu 16.04 master and Ubuntu 12.04 minion.

Versions Report

Master:

root@salt-001:~# salt --versions-report
Salt Version:
           Salt: 2016.3.2

Dependency Versions:
           cffi: 1.5.2
       cherrypy: Not Installed
       dateutil: 2.4.2
          gitdb: 0.6.4
      gitpython: 1.0.1
          ioflo: Not Installed
         Jinja2: 2.8
        libgit2: 0.24.0
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: 1.0.3
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: Not Installed
      pycparser: 2.14
       pycrypto: 2.6.1
         pygit2: 0.24.0
         Python: 2.7.12 (default, Jul  1 2016, 15:12:24)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.2.0
           RAET: Not Installed
          smmap: 0.9.0
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.1.4

System Versions:
           dist: Ubuntu 16.04 xenial
        machine: x86_64
        release: 4.4.0-34-generic
         system: Linux
        version: Ubuntu 16.04 xenial

Minion:

oot@dn-080:~# salt-call --versions-report
Salt Version:
           Salt: 2016.3.2

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 1.5
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.6
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: 0.7.0
   msgpack-pure: Not Installed
 msgpack-python: 0.3.0
   mysql-python: 1.2.3
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.7.3 (default, Jun 22 2015, 19:33:41)
   python-gnupg: Not Installed
         PyYAML: 3.10
          PyZMQ: 14.0.1
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5

System Versions:
           dist: Ubuntu 12.04 precise
        machine: x86_64
        release: 3.13.0-53-generic
         system: Linux
        version: Ubuntu 12.04 precise

cachedout commented 8 years ago

This works fine for me:

mp@silver ...salt/salt/output % sudo salt -C 'G@os:Arch' test.ping                                                                                                                                                                               (git)-[v2016.3.2] 
silver:
    True

Salt Version:
           Salt: 2016.3.2

Dependency Versions:
           cffi: 1.7.0
       cherrypy: 6.0.0
       dateutil: 2.5.3
          gitdb: 0.6.4
      gitpython: 1.0.0
          ioflo: 1.4.3
         Jinja2: 2.8
        libgit2: Not Installed
        libnacl: 1.4.0
       M2Crypto: 0.24.0
           Mako: 1.0.4
   msgpack-pure: Not Installed
 msgpack-python: 0.4.7
   mysql-python: 1.2.5
      pycparser: 2.14
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.7.12 (default, Jun 28 2016, 08:31:05)
   python-gnupg: 2.0.2
         PyYAML: 3.11
          PyZMQ: 15.2.0
           RAET: 0.6.5
          smmap: 0.9.0
        timelib: 0.2.4
        Tornado: 4.4.1
            ZMQ: 4.1.5

System Versions:
           dist:   
        machine: x86_64
        release: 4.6.4-1-ARCH
         system: Linux
        version: Not Installed

roflmao commented 8 years ago

Looks like it only happens with custom grain (facter.py from https://github.com/saltstack/salt-contrib/blob/master/grains/facter.py)

Ch3LL commented 8 years ago

I'm also not able to replicate this even with a custom grain:

#!/usr/bin/env python
def yourfunction():
     # initialize a grains dictionary
     grains = {}
     # Some code for logic that sets grains like
     grains['yourcustomgrain'] = True
     grains['anothergrain'] = 'somevalue'
     return grains

root@76beeb3d26cc:/testing# salt -v -C G@yourcustomgrain:True test.ping
Executing job with jid 20160915194409460350
-------------------------------------------

76beeb3d26cc:
    True

If you use the customgrain example above do you still see this behavior?

roflmao commented 8 years ago

On the node:

root@photol-011:~# salt-call grains.item facter_cassandra_cluster
[WARNING ] /usr/lib/python2.7/dist-packages/salt/grains/core.py:1493: DeprecationWarning: The "osmajorrelease" will be a type of an integer.

local:
    ----------
    facter_cassandra_cluster:
        PhotostreamCassandraLarge

On the master:

root@salt-001:~# salt -C G@facter_cassandra_cluster:PhotostreamCassandraLarge cmd.run 'hostname'
photol-011.X:
    Minion did not return. [No response]

root@salt-001:~# salt photol-011* cmd.run 'hostname'
photol-011.X:
    photol-011

wolfpackmars2 commented 8 years ago

what if you try using Grain matching instead of compound matching?

salt -G 'facter_cassandra_cluster:PhotostreamCassandraLarge' test.ping

The problem might also be the lack of quotes in the matching statement.

It appears from the Master that the correct minion is being targeted, just that the minion isn't responding for some reason, or the master isn't receiving the response. Try running minion and master in debug mode so you can see what each is doing in real time. That should give some insight into what's going on.

roflmao commented 8 years ago

root@salt-001:~# salt -G 'facter_cassandra_cluster:PhotostreamCassandraLarge' test.ping
photol-011.X:
    Minion did not return. [No response]

-------------------------------------------
Summary
-------------------------------------------
# of minions targeted: 1
# of minions returned: 0
# of minions that did not return: 1
# of minions with errors: 0
-------------------------------------------
root@salt-001:~# salt photol-011* test.ping
photol-011.X:
    True

-------------------------------------------
Summary
-------------------------------------------
# of minions targeted: 1
# of minions returned: 1
# of minions that did not return: 0
# of minions with errors: 0
-------------------------------------------

Will return with more debug info

sbreidba commented 7 years ago

I'm seeing the same issue with a custom grain (one set via grains.set, not a custom module). Happens on both Windows and an Ubuntu client, salt master/minion version 2016.3.4.

gaetanquentin commented 7 years ago

same issue for me 2016.11.1 minion: redhat 6.x salt master : ubuntu 16.04

i had to restart minions for them to work again with compound target

djha-skin commented 7 years ago

Happening for me also.

# salt --versions
Salt Version:
           Salt: 2016.11.3

Dependency Versions:
           cffi: 1.9.1
       cherrypy: unknown
       dateutil: 2.4.2
          gitdb: 0.6.4
      gitpython: 1.0.1
          ioflo: Not Installed
         Jinja2: 2.8
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: 1.0.3
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: Not Installed
      pycparser: 2.17
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.7.12 (default, Nov 19 2016, 06:48:10)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.2.0
           RAET: Not Installed
          smmap: 0.9.0
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.1.4

System Versions:
           dist: Ubuntu 16.04 xenial
        machine: x86_64
        release: 4.4.0-34-generic
         system: Linux
        version: Ubuntu 16.04 xenial

djha-skin commented 7 years ago

@gtmanfred This bug is still a thing, I can reproduce it. set a custom non-custom-module grain using 'grains.set' and try to compound match against it. It's affecting our environment in a big way.

djha-skin commented 7 years ago

When I try to match with compound, I get this in the debug salt minion error logs:

2017-04-10 22:19:02,827 [salt.minion                                              ][DEBUG   ][19410] compound_match q-gp2-gpmas-2.imovetv.com ? "G@groups:dynamux and G@roles:pack" => "False and False"

And when I match by glob, I get this:

2017-04-10 22:22:23,607 [salt.minion                                              ][DEBUG   ][19410] c
ompound_match: q-gp2-gpmas-2.imovetv.com ? q-gp2-gpmas-2.imovetv.com
2017-04-10 22:22:23,608 [salt.minion                                              ][DEBUG   ][19410] compound_match q-gp2-gpmas-2.imovetv.com ? "q-gp2-gpmas-2.imovetv.com" => "True"
2017-04-10 22:22:23,609 [salt.minion                                              ][INFO    ][19410] User root Executing command test.ping with jid 20170410222223593870

When I run salt, I'm running like this:

salt -l debug  -C 'G@groups:foo and G@roles:bar' test.ping

Indeed, when I run salt-call --local grains.get foo or salt-call --local grains.get bar I get nothing back.

Those grains are not set on the minions, nor do they show up when I query for grains using the salt master.

Both the salt master and the salt minion report that these grains do not exists on that minion. Yet, when I target based on grains, the minion in question is targeted, the minion detects that those grains aren't set on it, and so it does not return.

gtmanfred commented 7 years ago

Have you run saltutil.refresh_grains on the minions?

Are they available in grains.items?

djha-skin commented 7 years ago

No.

I restarted the minion and re ran the grain targeting. This time the minion I restarted didn't get targeted. I also have minion_local_cache set to false. I can attach logs tomorrow.

On April 10, 2017 5:16:50 PM MDT, Daniel Wallace notifications@github.com wrote:

Have you run saltutil.refresh_grains on the minions?

Are they available in grains.items?

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/saltstack/salt/issues/35439#issuecomment-293105591

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

djha-skin commented 7 years ago

Some context on this issue.

We suspect that this issue was caused in our environment because the DNS load-balanced server started returning faulty answers when queried, and for a while several minions could not resolve the DNS entry for the salt master. That's about the time when this issue started. Most of the minions I visited which responded (but shouldn't have) couldn't resolve the salt master via ping. This activated the bug.

Hope this helps.

gtmanfred commented 7 years ago

I will see if I can replicate this with that information, unfortunately if minions aren't connecting because the dns is unresponsive, there is not a lot that we can do.

I will take another look tomorrow

djha-skin commented 7 years ago

Thanks!

djha-skin commented 7 years ago

I can reproduce this issue. See attached tarball. Unzip it and enter the directory. running vagrant up inside it should yield an environment with two minions. I took the following screenshot to show how to reproduce:

saltbug.tar.gz

In summary, whenever grain targeting is used, all minions that do not have their cache up to date within the salt master cache are targeted regardless of whether or not the grain in question is actually set.

This was talked about in #37261, as if it was a feature that was understood, that if the master's minion cache is not up-to-date, it's expected behavior.

But this doesn't work in our environment. This is because sometimes the master's minion cache is cleared, and sometimes minions are simply offline. Their removal from the cache and their offline status means that they will be erroneously targeted, not just for module runs, but (in my case) for mine searches. Minions that have no business getting into an apache zookeeper cluster, for example, get in to it because their cache is not up to date.

Any help getting around this bug until it is fixed would be amazing. This is really affecting our deployement of salt.

Thanks for your time @gtmanfred

djha-skin commented 7 years ago

Some more information:

If I set minion_data_cache to false when the salt master's minions cache folder is empty, this problem persists, even when I run salt '*' saltutil.sync_all on the minions. So, we can conclude that the minion_data_cache option is a non-option until this bug is fixed. You cannot opt out of the minions cache using this option. All it does is prevent you from being able to sync the cache, thus disabling minion targeting permanently, or making it inaccurate. This is because whether or not the cache is updated, it is still used. 😢

gtmanfred commented 7 years ago

So, the actual problem is that the master is guessing that these other minions are going to return.

If the grains cache for the minion does not exist, then the Master makes a guess and assumes the minion will run the commands. This is just an guess, because it has no way to know if the minion will run or not, but in our opinion it is better to wait for the minion to return than to just skip it.

The minion does not actually run the command, because the minion will look at it's own grains, and decide that it should not run the command based on the targeting.

You cannot set minion_data_cache to False if you want to use grains for targeting and not deal with all the minions timing out. Period.

Daniel

djha-skin commented 7 years ago

@gtmanfred ,

I appreciate the feedback about setting minion_data_cache to false, how that is not possible when using grain targeting. It is useful to know :) I also understand that salt is doing its best not to break stuff, and is doing the safe thing. I really like how the salt team takes great care to do the safe thing and how you guys go out of your way to always be backwards compatible. Further, I understand this change was made to improve salt.

The problem with the guessing you are talking about really becomes a problem when we talk about the mine. Consider this (simplified) jinja form found in one of our salt states:

{%- set cluster_members = salt['mine.get']('G@foo:bar and G@baz:quxx', 'network_interfaces', expr_form='compound') %}

Before 2016.11, this form worked flawlessly. With the current guessing behavior, it becomes very dangerous, because it puts all sorts of nodes into the cluster that shouldn't be there. Therefore, the change made was a good one, but it is backwards incompatible.

A solution might be to create a config option in the salt master telling salt not to guess, or a keyword argument in mine.get telling salt not to guess while mining, with the keyword argument enabled by default.

Thanks for looking into this issue.

gtmanfred commented 7 years ago

If you want to match that way, and actually check the minions for the information instead of relying on the master cache for grains and such, you should check out publish.publish.

This will only return the information on minions that return data.

https://docs.saltstack.com/en/latest/ref/peer.html

gtmanfred commented 7 years ago

Also, you said that this worked differently in versions before 2016.11?

But I am seeing the same behaviour in 2016.3

root@saltmaster:~# salt \* grains.get examplegrain
d-b.d.example.com:
d-a.d.example.com:
    faraway
root@saltmaster:~# salt -C 'G@examplegrain:faraway' test.ping
d-a.d.example.com:
    True
root@saltmaster:~# rm -rf /var/cache/salt/master/minions/
root@saltmaster:~# salt -C 'G@examplegrain:faraway' test.ping
d-a.d.example.com:
    True
d-b.d.example.com:
    Minion did not return. [No response]
root@saltmaster:~# salt --versions-report
Salt Version:
           Salt: 2016.3.6

Dependency Versions:
           cffi: 1.5.2
       cherrypy: Not Installed
       dateutil: 2.4.2
          gitdb: 0.6.4
      gitpython: 1.0.1
          ioflo: Not Installed
         Jinja2: 2.8
        libgit2: 0.24.0
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: 1.0.3
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: Not Installed
      pycparser: 2.14
       pycrypto: 2.6.1
         pygit2: 0.24.0
         Python: 2.7.12 (default, Nov 19 2016, 06:48:10)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.2.0
           RAET: Not Installed
          smmap: 0.9.0
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.1.4

System Versions:
           dist: Ubuntu 16.04 xenial
        machine: x86_64
        release: 4.4.0-21-generic
         system: Linux
        version: Ubuntu 16.04 xenial

Which previous version did this behave in a different way so I can do a bisect and see where we changed it?

Thanks, Daniel

djha-skin commented 7 years ago

I think you are right. I've read other bugs and they say this behavior started around 2016.3 . I only know we moved from 2015.8 and lower and standardized on 2016.11 .

Thanks for the tip about publish.publish, but that still does not fix mine.get. For that matter, publish.publish uses targeting, so we've not really solved the problem. Whether by using publish.publish or mine.get, grain targeting is still unusable. Also, I don't want to give my minions permission to send commands to each other, nor do I want to use publish for something it was not designed for. It was designed to pass commands, not pass data.

djha-skin commented 7 years ago

I just confirmed that this problem exists regardless of whether or not salt '*' saltutil.sync_all is called, when the cache backend is consul.

Steps to reproduce, using the vagrant environment I attached above:

Start Consul on the salt master:

mkdir -p /var/lib/consul
mkdir -p /etc/consul.d
consul agent -server -bootstrap-expect=1 \
-data-dir=/var/lib/consul -node=agent-one -bind=172.20.20.10 \
-datacenter=dc1 -config-dir=/etc/consul.d

Configure consul cache on the salt master:

cache: consul
consul.host: 127.0.0.1
consul.port: 8500
consul.token: None
consul.scheme: http
consul.consistency: default
consul.dc: dc1
consul.verify: True

Restart salt master
Run rm -rf /var/cache/salt/master/minions/*
Run salt '*' saltutil.sync_all
To verify that salt did indeed sync minion cache data, run consul kv get -recurse -detailed minions
Run salt -C 'G@examplegrain:faraway' test.ping

djha-skin commented 7 years ago

@gtmanfred should I put the above in a separate ticket? It's almost a different issue, in that the consul backend doesn't work

djha-skin commented 7 years ago

Copying @Ch3LL

gtmanfred commented 7 years ago

Yes, please make another ticket.

We are currently discussing internally the correct way to change the matching, and thinking about making the default matching if grains are missing to be negative instead of what it currently is positive

djha-skin commented 7 years ago

Yay! thanks for thinking about this, I will make another ticket. 👍

gtmanfred commented 7 years ago

@saltstack/team-core I am looking for feedback on the proposal to change the current behavior of positive matching with the grains cache is missing for a minion.

Right now if the /var/cache/salt/master/minions/* is empty, and we target on grains, the master decides that it should positively match all minions.

root@saltmaster:~# salt \* grains.get examplegrain
saltmaster.example.com:
b.example.com:
a.example.com:
    faraway
root@saltmaster:~# salt -G 'examplegrain:faraway' test.ping
a.example.com:
    True
root@saltmaster:~# rm -rf /var/cache/salt/master/minions/
root@saltmaster:~# salt -G 'examplegrain:faraway' test.ping
a.example.com:
    True
saltmaster.example.com:
    Minion did not return. [No response]
b.example.com:
    Minion did not return. [No response]
root@saltmaster:~# salt-run manage.status tgt='examplegrain:faraway' expr_form=grain
down:
    - saltmaster.example.com
    - b.example.com
up:
    - a.example.com
[INFO    ] Runner completed: 20170419121944772803
root@saltmaster:~# salt \* test.ping
a.example.com:
    True
b.example.com:
    True
saltmaster.example.com:
    True

As you can see in the above example, this can show as minions that should be listed as up, not being targeted, but not running the commands they are targeted with.

The proposal is to make this a negative match if the grains cache is not found. If this was changed, then the above cases would match none of the minions while the cache's do not exist (not even a.example.com) until a saltutil.sync_all was run.

Thanks, Daniel

cachedout commented 7 years ago

tbh, I'm not super-thrilled about making that the default. I'd accept it as a config option but I'd need more convincing otherwise.

whiteinge commented 7 years ago

👍 for adding this under a config-gate. That also opens the door to putting the current behavior on a deprecation path if we decide to go that route in the future.

djha-skin commented 7 years ago

I really appreciate you guys working on this

gaetanquentin commented 7 years ago

Same problem with 2016.11.3: Master: Salt Version: Salt: 2016.11.3

Dependency Versions: cffi: Not Installed cherrypy: 3.5.0 dateutil: 2.4.2 gitdb: 0.6.4 gitpython: 1.0.1 ioflo: Not Installed Jinja2: 2.8 libgit2: Not Installed libnacl: Not Installed M2Crypto: Not Installed Mako: 1.0.3 msgpack-pure: Not Installed msgpack-python: 0.4.6 mysql-python: Not Installed pycparser: Not Installed pycrypto: 2.6.1 pygit2: Not Installed Python: 2.7.12 (default, Nov 19 2016, 06:48:10) python-gnupg: Not Installed PyYAML: 3.11 PyZMQ: 15.2.0 RAET: Not Installed smmap: 0.9.0 timelib: Not Installed Tornado: 4.2.1 ZMQ: 4.1.4

System Versions: dist: Ubuntu 16.04 xenial machine: x86_64 release: 4.4.0-64-generic system: Linux version: Ubuntu 16.04 xenial

Minion: Salt Version: Salt: 2016.11.1

Dependency Versions: cffi: Not Installed cherrypy: Not Installed dateutil: 1.4.1 gitdb: Not Installed gitpython: Not Installed ioflo: Not Installed Jinja2: 2.7.3 libgit2: Not Installed libnacl: Not Installed M2Crypto: 0.20.2 Mako: Not Installed msgpack-pure: Not Installed msgpack-python: 0.4.6 mysql-python: Not Installed pycparser: Not Installed pycrypto: 2.6.1 pygit2: Not Installed Python: 2.6.6 (r266:84292, May 22 2015, 08:34:51) python-gnupg: Not Installed PyYAML: 3.11 PyZMQ: 14.5.0 RAET: Not Installed smmap: Not Installed timelib: Not Installed Tornado: 4.2.1 ZMQ: 4.0.5

salt-minion-strace.tar.gz

System Versions: dist: redhat 6.7 Santiago machine: x86_64 release: 2.6.32-573.8.1.el6.x86_64 system: Linux version: Red Hat Enterprise Linux Server 6.7 Santiago

salt '*' test.ping works

salt -C 'G@id:XXX' test.ping : l

the target return TRUE
the other minions: long wait and error "minion did not return"

The other minions was running fine since one month, and became unstable. We restart the salt-master every day, because of memory leaks which make it hangs after some days.

attached are the strace files on salt-minion pid from an unstable minion, catched when launching this command and the salt-master: salt -C 'G@id:XXX' test.ping

Restarting all the minions with: salt '*' cmd.run_bg 'salt-call --local service.restart salt-minion'

fixes the problem but not completely: one of the old salt-minion process stays:

before restart: there is 2 salt-minion processes
after cmd.run_bg 'salt-call --local service.restart salt-minion': there are 2 new salt-minion processes + one old.

Stopping the salt-minion with the official init script (service salt-minion stop) won't kill all salt-minion processes. Trying to stop it a second time will say:

service salt-minion stop -> Service salt-minion:root:cplbat-v01 is not running but processes are still here: ps auxw | grep salt | grep -v grep root 1039 0.0 0.2 282492 21884 ? S 13:25 0:00 /usr/bin/python2.6 /usr/bin/salt-minion -c /etc/salt -d root 6034 0.0 0.2 282492 21884 ? S Apr20 0:00 /usr/bin/python2.6 /usr/bin/salt-minion -c /etc/salt -d root 24948 0.0 0.2 282492 21884 ? S 13:33 0:00 /usr/bin/python2.6 /usr/bin/salt-minion -c /etc/salt -d

The pb occurs on all our redhat: RH 5.3, 5.6,5.11, 6.0,6.2,6.4,6.6,6.7

Regards

gaetanquentin commented 6 years ago

Pb still present, with salt master in 2017.7.2.

minions are

some are in redhat in 2017.7.2
majority is redhat 5.6 with salt 2016.11.3 -some are 2016.11.5 -some are 2016.11.6

a salt -C 'G@xx:vv' state.apply stateX may hangs, (minion Y did not return) because one of minion - which are not implied in the state application, do not respond.

This pb is quite old now and very annoying. we are obliged to restart all minions before to launch a "big" state.apply operation to be sure this problem won't happen.

voyvodov commented 6 years ago

What about checking which minions are alive before targeting with grains/mine if the cache is empty (as it's done in salt.runner.manage.alived)? Obviously minions which are dead and don't have cache on the master cannot respond with up-to-date grains and there is no sense to target them at all. This should reduce the number of false positives while targeting minions and still provide enough security that all needed minions will be targeted (except those which don't have cache and are dead for one or another reason).

djha-skin commented 6 years ago

The whole problem is that minions respond even when they are not targeted. True, dead minions do not, but you still have to reboot and refresh the cache for the live ones no matter what you do.

voyvodov commented 6 years ago

I don't think there is solution for this. If cache is missing you should check all minions if there is a match. Otherwise you can miss minions while targeting. By the way what is the reason to cleanup the whole cache? What if you run after cleanup salt '*' saltutil.refresh_grains ? This will allow to have up-to-date grains cache and avoid targeting minions which doesn't match the targeting expression. Still you'll need some automation to cleanup dead minions from salt-master keys.

gtmanfred commented 6 years ago

I am marking this as a feature request, but it will need to live behind a config option that would be enabled on the master.

Thanks, Daniel

djha-skin commented 6 years ago

YAY Thanks very much @gtmanfred, I am much obliged.

gtmanfred commented 6 years ago

It is worth noting that the minions do not actually respond or run anything, they are only added to the expected minions list. When the minion matches the job, it just passes on it because it is unable to match the grain in the job.

The minion name is just returned back from the ckminions.check_minion function.

https://github.com/saltstack/salt/blob/2017.7/salt/master.py#L1882

And the master times out waiting on the event return on the salt event stream. But nothing is actually returned on the minion. What would need to happen is just set the check grains minion function in ckminions to not return minions if it cant find the minion cache, and then the master will never wait for a return from them, and won't display the timeout.

Daniel

voyvodov commented 6 years ago

But what will happen if we have a brand new minion which didn't upload its grains to master cache? This will pass a possibly correct minion? Or I didn't understand correctly the change? Also, what will happen if someone clear the cache?

gtmanfred commented 6 years ago

The grains are rendered when the minion starts, and then sent to the master right after it connects, so if the minion has just connected, the master should already have the grains.

All that the change would do is remove minions from the expected minions list if the minions grain cache is not present. So if the grains cache is not there, but the minion does match it, you won't get the output in the salt command. This is why it would need to be a config gate, because there is always a possibility that minions could match and return but might not be in the expected minions for the salt command to wait for.

If someone clears the cache and had that option turned on, it is possible that they would not get any returns, again why it would need to be behind a config option.

bluesliverx commented 6 years ago

@gtmanfred, is it possible to run into this same issue targeting with pillar instead of grains? I'm looking into ways to handle a situation that sounds very similar to @djhaskin987, but we target with compound matchers on pillar data instead of grains. However, the behavior is the same, we get more minions than expected since their cache is missing. We'd really like to see this config option as well (if it does indeed apply). btw, hi @djhaskin987, it's been awhile :)

djha-skin commented 6 years ago

It doesn't matter. If a compound match is used, and it's not a globbed hostname expression, it's hitting the minion data cache. After all, it needs to know what pillars are set on the minion.

djha-skin commented 6 years ago

Hahaha, hi @bluesilverx

bluesliverx commented 6 years ago

After digging through a lot of code over the weekend, I have some findings for this. The only piece I have not figured out yet is how in the world the minion cache is removed from the master (@djhaskin987 simulated this with the rm -rf /var/cache/salt/master/minions call). Using the (default) localfs driver for the minion data cache should never purge any minion data as far as I can tell. Maybe someone on the thread has some insight on what does this.

After the minion data is cleared, the master will include any minions that have accepted keys (checking the pki directories) but do not have a cache folder in the minion data cache. This behavior is switched on or off in some places (not for sure that it has this switch in the actual targeting done with an async job, haven't tracked that down completely yet) with a greedy flag. This is referenced in #33730 as well, which was closed for no activity. This flag is not exposed from the salt command line calls, but is switched on/off internally in the methods used to gather the minions (https://github.com/saltstack/salt/blob/develop/salt/utils/minions.py#L668).

I believe there are 3 potential solutions here: 1) Do as the thread has already agreed to do and expose the greedy flag on the command line. This would allow us to make salt calls that did not include the minions. However, we could potentially miss minions that are not cached and they may just sit dormant until someone goes manually to clean them up (not ideal) 2) Write an additional salt runner for cache that does the same thing the check minions method does - find all missing minions (have an accepted key but no minion data cache entry) - and then refreshes cache data for those. This could then be used in a cronjob/schedule to make sure everything is clean. This isn't great since it's really just a band-aid on the problem. 3) Fix the core issue with the minion data cache not being populated and/or being cleared out for a minion. We only see this in our production environment with thousands of nodes, in other places we see it very intermittently. I have yet to be able to reproduce it on demand, even with taking down the salt master/pillar providers at certain times. Possibly I need to simulate network issues here to reproduce.

As I see it, option 3 is the only real solution here, since it is the original reason that we see problems. The greedy flag is a 'nice to have', but it's just a workaround really for the core issue. Thoughts?

anitakrueger commented 5 years ago

I have reproduced this on a vanilla 2019.2.0 installation and was wondering what the expected behavior is and how to automagically fix it? It is quite significantly affecting us, because we can't just set grains on a minion and then target the minion by those grains in the next step (this is our workflow of adding capacity atm). This is not only true for the compound matcher, but for the grains match as well.

Steps to reproduce:

Connect a new minion to the salt master.

Set a grain on the minion, for example the roles grain:

sudo salt 'local-minion-01*' grains.setval roles webserver
local-minion-01.local:
----------
roles:
    webserver

Get the grain to see if it was set properly:

sudo salt 'local-minion-01*' grains.get roles
local-minion-01.local:
webserver

Target by that same grain:

sudo salt -G 'roles:webserver' test.ping
local-minion-01.local:
Minion did not return. [No response]

So the minion did get targeted, but it is not executing the modules function. A saltutil.refresh_grains did not fix it. Only a restart of the salt-minion on the specific box will fix the targeting.

Salt Version:
           Salt: 2019.2.0

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: Not Installed
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.7.2
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.5 (default, Oct 30 2018, 23:45:53)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.3.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.1.4

System Versions:
           dist: centos 7.6.1810 Core
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-957.5.1.el7.x86_64
         system: Linux
        version: CentOS Linux 7.6.1810 Core

gtmanfred commented 5 years ago

@saltstack/team-core can someone take a look at this?

IIRC there was a bug specific to 2019.2 that caused @anitakrueger's issue, and is different error than what was discussed previously in this issue.

anitakrueger commented 5 years ago

Oh I was wondering if I should open a new github issue on this. I did find this one though because I used the compound matcher as well and in my test above it is also targeting the right minion, but the minion isn't executing what I'm sending to it. If there was another more programmatic way of fixing this, I'm happy with a workaround. But restarting the salt-minion seems like the "wrong" fix.

waynew commented 5 years ago

But restarting the salt-minion seems like the "wrong" fix.

I heartily agree!

If your issue is actually a different problem we should definitely open a new issue (and maybe mention this one from there).

saltstack / salt