saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Get access to the Salt software package repository here:
https://repo.saltproject.io/
Apache License 2.0
14.11k stars 5.47k forks source link

complex nodegroups results in false "Minion did not return errors" #65523

Open matthewsht opened 10 months ago

matthewsht commented 10 months ago

Description All, We use nodegroups in our monthly patching cycle, building the final list of hosts to patch with several "not" groups anded together. These are in turn Grain and List based themselves. This leads to 4 hosts consistently yielding "Minion did not return errors" -- those minion should not have rec'd any command at all, and so this is a false error. [Sorry - this is hard to describe]

Setup All affected systems are 3005.3 . All systems are direct connected to salt-master03. Note: upgrade to 3006 is scheduled, but we're govt and can't just push the patch out.

Nodegroups.conf contains relevant lines (other comments and unrelated nodegroups elided):

nodegroups:
  patch-excluded: '' # systems that are not patched on an existing schedule, or are excluded this month
  patch-foundation-q: '( N@backup-servers or L@distro-master,salt-master03 ) and not N@patch-excluded'
  not-hpc-internal: 'G@hpc_internal:False'
  # has bug
  patch-normal: ' N@not-hpc-internal and not N@patch-excluded and not N@patch-foundation-q'
  # does not have bug
  #patch-normal: 'N@not-hpc-internal and not N@patch-excluded'
  # has bug
  #patch-normal: 'not N@patch-foundation and N@not-hpc-internal and not N@patch-excluded'
  backup-servers: 'L@backup-slave,backup-master'

I've tried moving the "backup-servers" nodegroup before patch-normal, but the problem is not order dependant.

Please be as specific as possible and give set-up details.

Steps to Reproduce the behavior This nodegroup setup yields a list of ALL of our systems (not-hpc-internal), EXCEPT the hpc-internal ones, and EXCEPT the hosts specifically listed in N@backup-servers and N@patch-foundation-q This explicit list is distro-master,salt-master03,backup-slave,backup-master

These 4 hosts yield the error referenced:

salt -N patch-normal test.ping
system1
    True
system2:
    True
system3:
AND MANY OTHERS
salt-master03:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20231108154033009594
backup-slave:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20231108154033009594
backup-master:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20231108154033009594
distro-master:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20231108154033009594
ERROR: Minions returned with non-zero exit code

This bug report is specifically around why these 4 nodes report this error: everything else is working as intended/desired.

Expected behavior We expect the command to not generate errors for the 4 systems specifically excluded.

Versions Report

salt --versions-report ```shell Salt Version: Salt: 3005.3 Dependency Versions: cffi: 1.14.6 cherrypy: unknown dateutil: 2.8.1 docker-py: Not Installed gitdb: 4.0.10 gitpython: 3.1.37 Jinja2: 3.1.0 libgit2: Not Installed M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.2 msgpack-pure: Not Installed mysql-python: Not Installed pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.9.8 pygit2: Not Installed Python: 3.9.18 (main, Nov 1 2022, 00:00:00) python-gnupg: 0.4.8 PyYAML: 6.0.1 PyZMQ: 23.2.0 smmap: 5.0.1 timelib: 0.2.4 Tornado: 4.5.3 ZMQ: 4.3.4 System Versions: dist: centos 9 locale: utf-8 machine: x86_64 release: 5.14.0-370.el9.x86_64 system: Linux version: CentOS Stream 9 ``` I can pretty easily add/modify these nodegroups for testing - please let me know.
OrangeDog commented 10 months ago

This looks like a basic consequence of how targeting works.

The master will guess which minions are going to respond by running the targeting against its data caches, but it broadcasts the job details and any minion that thinks it matches will run it and return its result.

You've either found a case where the master's cache is out of date, or hit a limit on how much guessing the master is prepared to do.