Sporadic KeyError: 'runner' in reactor

jbg-uwaterloo commented 2 years ago

Description 3002 salt master on RHEL7. We have a reactor watching and alerting via email on failures. The reactor generally works and has generated 7 valid and 2 errors today. On Jan 01 it produced 3 valid and one error.

Python trace below

2022-01-04 08:05:29,651 [salt.utils.reactor:489 ][ERROR   ][2058] Reactor 'email-on-fail' failed to execute runner 'process_minion_data.email_errors'
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/salt/utils/reactor.py", line 468, in run
    ret = l_fun(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/salt/utils/reactor.py", line 496, in runner
    return self.pool.fire_async(self.client_cache["runner"].low, args=(fun, kwargs))
  File "/usr/lib/python3.6/site-packages/salt/utils/cache.py", line 70, in __getitem__
    return dict.__getitem__(self, key)
KeyError: 'runner'

Setup Reactor config (email addresses sanitized)

email-on-fail:
  runner.process_minion_data.email_errors:
    - fromaddr: XXXXXXXXXXX
    - toaddrs: XXXXXXXXXXXXX
    - data_str: {{ data|yaml_dquote }}
    - smtp_server: 127.0.0.1

(Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)

Please be as specific as possible and give set-up details. RHEL7 on-prem VM.

Steps to Reproduce the behavior (Include debug logs if possible and relevant)

Expected behavior No python errors reported in salt master log

Screenshots

Versions Report

Salt Version:
          Salt: 3002.7

Dependency Versions:
          cffi: Not Installed
      cherrypy: unknown
      dateutil: 2.4.2
     docker-py: Not Installed
         gitdb: 0.6.4
     gitpython: 1.0.1
        Jinja2: 2.11.1
       libgit2: Not Installed
      M2Crypto: 0.35.2
          Mako: Not Installed
       msgpack: 0.6.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: Not Installed
      pycrypto: 2.6.1
  pycryptodome: Not Installed
        pygit2: Not Installed
        Python: 3.6.8 (default, Aug 13 2020, 07:46:32)
  python-gnupg: Not Installed
        PyYAML: 3.13
         PyZMQ: 17.0.0
         smmap: 0.9.0
       timelib: Not Installed
       Tornado: 4.5.3
           ZMQ: 4.1.4

System Versions:
          dist: rhel 7.9 Maipo
        locale: UTF-8
       machine: x86_64
       release: 3.10.0-1160.49.1.el7.x86_64
        system: Linux
       version: Red Hat Enterprise Linux Server 7.9 Maipo

jbg-uwaterloo commented 2 years ago

Similar to https://github.com/saltstack/salt/issues/52961 (salt 3000)

waynew commented 2 years ago

Looking at this, I'm almost positive that the issue, for some reason, is that the reaction_type isn't in self.client_cache - there's code in salt/utils/reactor that removes the item from the cache if it's expired. In my limited testing, I couldn't get it to exhibit this behavior, even with a ridiculously small TTL (2s, vs default 60s). It would delete the runner from the cache & retry.

It's entirely possible, though, that there's something else at play here with the interplay of more than one runner, the time it takes to actually run that runner, etc.

If someone has some spare cycles to try and track that down, that would be .... oh nope, just figured it out!

In salt/utils/reactor.py, it calls self.populate_client_cache(low), and then it gets the runner function, and calls it with l_fun(*args, **kwargs) (seen in the stack trace)

With my 2s timer, I added this:

            import time; time.sleep(5)
            ret = l_fun(*args, **kwargs)

100% of the time I see this error. So it looks like what's happening is between grabbing runner and actually making this call the TTL expires.

One potential fix would be to retrieve the runner from client_cache and pass that to the runner function instead. It looks like wheel would suffer the same problem. Another option would be to catch the exception and maybe repopulate the cache? I'm not currently familiar with why the cache/TTL is even happening, so there may be a better way to handle that. My gut says that if we hit salt/utils/reactor::ReactWrap.run and there's a live function at the start of the call it should be useful when it's finally called... but I'm not positive.

Anyway - thanks again for the report!

gbunt commented 2 years ago

We're seeing the same thing in v3002.2:

2022-10-11 13:25:08,469 [salt.template    :31  ][PROFILE ][1401253] Time (in seconds) to render '/srv/salt/reactor/sync_dns.sls' using 'yaml' renderer: 0.0008473396301269531
2022-10-11 13:25:08,471 [salt.utils.reactor:489 ][ERROR   ][1401253] Reactor 'sync_dns' failed to execute runner 'state.orchestrate'
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/salt/utils/reactor.py", line 442, in run
    react_fun = self.client_cache[low["state"]].functions.get(
  File "/usr/lib/python3.6/site-packages/salt/utils/cache.py", line 70, in __getitem__
    return dict.__getitem__(self, key)
KeyError: 'runner'

On average, i'd say 1% of the reactor calls runs into this race condition. In our case, that's mission critical. Given there's no follow up on this issue i can only assume it isn't fixed in v3005, though we will upgrade to a latest version soon.

fwiw, i've not been able to reproduce it anymore after enabling retry for the reactor runner:

sync_dns:
  runner.state.orchestrate:
    - retry:
        attempts: 3
        interval: 1
    - args:
      - mods: sync_dns

Looking at the reactor code i would say that's purely coincidental but we've not hit the exception anymore after a few thousand test runs.

vkotarov commented 1 year ago

Started seeing this 2 days ago and now I have 900+ log messages, each throwing the following:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/salt/utils/reactor.py", line 439, in run
    ret = l_fun(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/salt/utils/reactor.py", line 467, in runner
    return self.pool.fire_async(self.client_cache["runner"].low, args=(fun, kwargs))
  File "/usr/lib/python3.6/site-packages/salt/utils/cache.py", line 64, in __getitem__
    return dict.__getitem__(self, key)
KeyError: 'runner'

I have several events configured to run orchestrate runners and this happens with all of them. v3004.2 here.

gmarzot commented 1 year ago

seen also: 3004.2

[ERROR ] Reactor 'handle_stats' failed to execute runner 'salt.cmd' Traceback (most recent call last): File "/usr/lib/python3.10/site-packages/salt/utils/reactor.py", line 439, in run ret = l_fun(*args, **kwargs) File "/usr/lib/python3.10/site-packages/salt/utils/reactor.py", line 467, in runner return self.pool.fire_async(self.client_cache["runner"].low, args=(fun, kwargs)) File "/usr/lib/python3.10/site-packages/salt/utils/cache.py", line 64, in getitem return dict.getitem(self, key) KeyError: 'runner'

happening all the time

i also tried the retry approach from above - no observable effect

doesitblend commented 1 year ago

@garethgreenaway This appears similar to https://github.com/saltstack/salt/issues/52961. I have additional logs available if required to dig into this further.

wasabi222 commented 3 weeks ago

I was able to fix this using @waynew's suggestion above. Here's a patch:

--- utils/reactor.py    2024-09-30 09:55:06
+++ utils/reactor.py    2024-09-30 09:54:47
@@ -461,6 +461,10 @@
         """
         Wrap RunnerClient for executing :ref:`runner modules <all-salt.runners>`
         """
+        if "runner" not in self.client_cache:
+            log.debug("reactor edge case: re-populating client_cache")
+            low = {"state": "runner"}
+            self.populate_client_cache(low)
         return self.pool.fire_async(self.client_cache["runner"].low, args=(fun, kwargs))

     def wheel(self, fun, **kwargs):

This should be expanded to include other file_clients as well. I would submit a PR but I still don't understand how salt's testing stack works, but maybe somebody can run with this.

saltstack / salt

Sporadic KeyError: 'runner' in reactor #61416