saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Install Salt from the Salt package repositories here:
https://docs.saltproject.io/salt/install-guide/en/latest/
Apache License 2.0
14.21k stars 5.48k forks source link

local_cache returner fails on missing .minions.p on master of masters #60251

Open onmeac opened 3 years ago

onmeac commented 3 years ago

Description of Issue

In a master of master setup with one or more syndic servers the .minions.p file inside job results cache directory will be absent when salt commands are run/started from syndic, causing salt-run jobs.lookup_jid <jid> to fail on master of masters.

Setup

Steps to Reproduce Issue

[1]: .minions.py file created in /var/cache/salt/master/jobs/<some random jid dir> [2]: .minions.py absent in /var/cache/salt/master/jobs/<some random jid dir>

Both random jid directories will have a .minions.<name of syndic>.p file.

When looking up a jid result the local_cache returner will create a list containing path to .minions.p (MINIONS_P) and then extend that list with .minions.<name of syndic>.p (SYNDIC_MINIONS_P)

code from returners/local_cache.py:

317     minions_cache = [os.path.join(jid_dir, MINIONS_P)]
318     minions_cache.extend(
319         glob.glob(os.path.join(jid_dir, SYNDIC_MINIONS_P.format('*')))
320     )
321     all_minions = set()
322     for minions_path in minions_cache:
323         log.debug('Reading minion list from %s', minions_path)
324         try:
325             with salt.utils.files.fopen(minions_path, 'rb') as rfh:
326                 all_minions.update(serial.load(rfh))
327         except IOError as exc:
328             salt.utils.files.process_read_exception(exc, minions_path)

Because MINIONS_P does not exist, process_read_exception exception is raised. Example exception:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/salt/client/mixins.py", line 374, in low
    data['return'] = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/salt/runners/jobs.py", line 128, in lookup_jid
    display_progress=display_progress
  File "/usr/lib/python3.6/site-packages/salt/runners/jobs.py", line 198, in list_job
    job = mminion.returners['{0}.get_load'.format(returner)](jid)
  File "/usr/lib/python3.6/site-packages/salt/returners/local_cache.py", line 328, in get_load
    salt.utils.files.process_read_exception(exc, minions_path)
  File "/usr/lib/python3.6/site-packages/salt/utils/files.py", line 225, in process_read_exception
    raise CommandExecutionError('{0} does not exist'.format(path))
salt.exceptions.CommandExecutionError: /var/cache/salt/master/jobs/ff/29df13854b66d66262bbb9484de6dc180c140489b3e75871f34f0a6e5c957c/.minions.p does not exist

Possible solutions:

process_read_exception takes an optional argument to ignore certain error codes that might be an option? Or perhaps an if statement to check if os.path.join(jid_dir, MINIONS_P) exists?

Versions Report

Salt Version:
           Salt: 3000.9

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: Not Installed
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
         Jinja2: 2.11.1
        libgit2: Not Installed
       M2Crypto: 0.35.2
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.6.2
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: 3.9.7
         pygit2: Not Installed
         Python: 3.6.8 (default, Nov 16 2020, 16:55:22)
   python-gnupg: Not Installed
         PyYAML: 3.13
          PyZMQ: 15.3.0
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.1.4

System Versions:
           dist: centos 7.9.2009 Core
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-1160.24.1.el7.x86_64
         system: Linux
        version: CentOS Linux 7.9.2009 Core
waynew commented 3 years ago

@onmeac thanks for the report! Trying to reproduce this issue but so far I'm not having any success. Or I'm having success? :joy:

Basically, I can't seem to make this fail - salt-run jobs.lookup_jid works just fine for me. Always. I'll look into this a bit more tomorrow to see if I can reproduce any sort of thing :+1:

waynew commented 3 years ago

60251.tar.gz

This is what I used to try and reproduce the issue. Requires some manual intervention to accept keys & start syndic process, but... yeah, lookup_jid kept working.

@onmeac do you think that you can use this as a jumping off point to create a full MCVE?

waynew commented 3 years ago

@onmeac I'm going to go ahead and close this issue for now since I can't reproduce. If you're able to put together a MCVE, let me know and I'll be happy to re-open this! (may need to ping on slack or IRC)

onmeac commented 3 years ago

from our slack chat yesterday

https://github.com/saltstack/salt/blob/master/salt/returners/local_cache.py

lines 273-276

if syndic_id is not None:
    minions_path = os.path.join(jid_dir, SYNDIC_MINIONS_P.format(syndic_id))
else:
    minions_path = os.path.join(jid_dir, MINIONS_P)

if job data is received on a master of masters from a downstream syndic minions_path is never os.path.join(jid_dir, MINIONS_P)

line 284: with salt.utils.files.fopen(minions_path, "w+b") as wfh: obviously creates whatever minions_path is at the time, this would be os.path.join(jid_dir, SYNDIC_MINIONS_P.format(syndic_id)) if job data was received from a downstream syndic.

lines 320-321

minions_cache = [os.path.join(jid_dir, MINIONS_P)]
minions_cache.extend(glob.glob(os.path.join(jid_dir, SYNDIC_MINIONS_P.format("*"))))

minions_cache is now a list where its very first element does not exist, that file was never created.

yet with salt.utils.files.fopen(minions_path, "rb") as rfh: (line 326) will attempt to open that file which raises the process_read_exception


Today

I could not get an environment up and running using your images so I used docker.io/saltstack/salt

60251_onmeac.tar.gz

This is what I did (using podman in root mode, with limited podman/container knowledge):

start mom container podman-compose -p mom_pod -f mom.yml up --build -d

get ip of mom container podman exec -ti mom ip a | grep -w inet

_set mom ip as syndicmaster in syndic/syndic file

podman-compose -p syndic_pod -f syndic.yml up --build -d

whait a bit for things to start

podman exec -ti syndic salt \* test.version --async

copy jid

podman exec -ti mom salt-run jobs.lookup_jid <jid>

exception:

Exception occurred in runner jobs.lookup_jid: Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/salt/returners/local_cache.py", line 308, in get_load
    with salt.utils.files.fopen(minions_path, "rb") as rfh:
  File "/usr/local/lib/python3.7/site-packages/salt/utils/files.py", line 385, in fopen
    f_handle = open(*args, **kwargs)  # pylint: disable=resource-leakage
FileNotFoundError: [Errno 2] No such file or directory: '/var/cache/salt/master/jobs/98/50909d32c4af0440c92c9020e6e3503d86b5aa6cf7f22676ee34c4c08c1cc1/.minions.p'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/salt/client/mixins.py", line 390, in low
    data["return"] = func(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/salt/loader/lazy.py", line 149, in __call__
    return self.loader.run(run_func, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/salt/loader/lazy.py", line 1201, in run
    return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/salt/loader/lazy.py", line 1216, in _run_as
    return _func_or_method(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/salt/runners/jobs.py", line 140, in lookup_jid
    data = list_job(jid, ext_source=ext_source, display_progress=display_progress)
  File "/usr/local/lib/python3.7/site-packages/salt/runners/jobs.py", line 205, in list_job
    job = mminion.returners["{}.get_load".format(returner)](jid)
  File "/usr/local/lib/python3.7/site-packages/salt/loader/lazy.py", line 149, in __call__
    return self.loader.run(run_func, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/salt/loader/lazy.py", line 1201, in run
    return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/salt/loader/lazy.py", line 1216, in _run_as
    return _func_or_method(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/salt/returners/local_cache.py", line 311, in get_load
    salt.utils.files.process_read_exception(exc, minions_path)
  File "/usr/local/lib/python3.7/site-packages/salt/utils/files.py", line 224, in process_read_exception
    raise CommandExecutionError("{} does not exist".format(path))
salt.exceptions.CommandExecutionError: /var/cache/salt/master/jobs/98/50909d32c4af0440c92c9020e6e3503d86b5aa6cf7f22676ee34c4c08c1cc1/.minions.p does not exist
johnl2323 commented 10 months ago

My MOM gets this same stack trace about every minute, spamming the master log file. Any update on this issue? I am running v3006.3

d3zorg commented 4 weeks ago

getting this after upgrade to 3006.9 on syndics and main master