Open jbg-uwaterloo opened 2 years ago
Similar to https://github.com/saltstack/salt/issues/52961 (salt 3000)
Looking at this, I'm almost positive that the issue, for some reason, is that the reaction_type isn't in self.client_cache
- there's code in salt/utils/reactor that removes the item from the cache if it's expired. In my limited testing, I couldn't get it to exhibit this behavior, even with a ridiculously small TTL (2s, vs default 60s). It would delete the runner from the cache & retry.
It's entirely possible, though, that there's something else at play here with the interplay of more than one runner, the time it takes to actually run that runner, etc.
If someone has some spare cycles to try and track that down, that would be .... oh nope, just figured it out!
In salt/utils/reactor.py, it calls self.populate_client_cache(low)
, and then it gets the runner function, and calls it with l_fun(*args, **kwargs)
(seen in the stack trace)
With my 2s timer, I added this:
import time; time.sleep(5)
ret = l_fun(*args, **kwargs)
100% of the time I see this error. So it looks like what's happening is between grabbing runner
and actually making this call the TTL expires.
One potential fix would be to retrieve the runner from client_cache
and pass that to the runner function instead. It looks like wheel
would suffer the same problem. Another option would be to catch the exception and maybe repopulate the cache? I'm not currently familiar with why the cache/TTL is even happening, so there may be a better way to handle that. My gut says that if we hit salt/utils/reactor::ReactWrap.run
and there's a live function at the start of the call it should be useful when it's finally called... but I'm not positive.
Anyway - thanks again for the report!
We're seeing the same thing in v3002.2:
2022-10-11 13:25:08,469 [salt.template :31 ][PROFILE ][1401253] Time (in seconds) to render '/srv/salt/reactor/sync_dns.sls' using 'yaml' renderer: 0.0008473396301269531
2022-10-11 13:25:08,471 [salt.utils.reactor:489 ][ERROR ][1401253] Reactor 'sync_dns' failed to execute runner 'state.orchestrate'
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/salt/utils/reactor.py", line 442, in run
react_fun = self.client_cache[low["state"]].functions.get(
File "/usr/lib/python3.6/site-packages/salt/utils/cache.py", line 70, in __getitem__
return dict.__getitem__(self, key)
KeyError: 'runner'
On average, i'd say 1% of the reactor calls runs into this race condition. In our case, that's mission critical. Given there's no follow up on this issue i can only assume it isn't fixed in v3005, though we will upgrade to a latest version soon.
fwiw, i've not been able to reproduce it anymore after enabling retry
for the reactor runner:
sync_dns:
runner.state.orchestrate:
- retry:
attempts: 3
interval: 1
- args:
- mods: sync_dns
Looking at the reactor code i would say that's purely coincidental but we've not hit the exception anymore after a few thousand test runs.
Started seeing this 2 days ago and now I have 900+ log messages, each throwing the following:
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/salt/utils/reactor.py", line 439, in run
ret = l_fun(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/salt/utils/reactor.py", line 467, in runner
return self.pool.fire_async(self.client_cache["runner"].low, args=(fun, kwargs))
File "/usr/lib/python3.6/site-packages/salt/utils/cache.py", line 64, in __getitem__
return dict.__getitem__(self, key)
KeyError: 'runner'
I have several events configured to run orchestrate runners and this happens with all of them. v3004.2 here.
seen also: 3004.2
[ERROR ] Reactor 'handle_stats' failed to execute runner 'salt.cmd' Traceback (most recent call last): File "/usr/lib/python3.10/site-packages/salt/utils/reactor.py", line 439, in run ret = l_fun(*args, **kwargs) File "/usr/lib/python3.10/site-packages/salt/utils/reactor.py", line 467, in runner return self.pool.fire_async(self.client_cache["runner"].low, args=(fun, kwargs)) File "/usr/lib/python3.10/site-packages/salt/utils/cache.py", line 64, in getitem return dict.getitem(self, key) KeyError: 'runner'
happening all the time
i also tried the retry approach from above - no observable effect
@garethgreenaway This appears similar to https://github.com/saltstack/salt/issues/52961. I have additional logs available if required to dig into this further.
I was able to fix this using @waynew's suggestion above. Here's a patch:
--- utils/reactor.py 2024-09-30 09:55:06
+++ utils/reactor.py 2024-09-30 09:54:47
@@ -461,6 +461,10 @@
"""
Wrap RunnerClient for executing :ref:`runner modules <all-salt.runners>`
"""
+ if "runner" not in self.client_cache:
+ log.debug("reactor edge case: re-populating client_cache")
+ low = {"state": "runner"}
+ self.populate_client_cache(low)
return self.pool.fire_async(self.client_cache["runner"].low, args=(fun, kwargs))
def wheel(self, fun, **kwargs):
This should be expanded to include other file_clients as well. I would submit a PR but I still don't understand how salt's testing stack works, but maybe somebody can run with this.
Description 3002 salt master on RHEL7. We have a reactor watching and alerting via email on failures. The reactor generally works and has generated 7 valid and 2 errors today. On Jan 01 it produced 3 valid and one error.
Python trace below
Setup Reactor config (email addresses sanitized)
(Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)
Please be as specific as possible and give set-up details. RHEL7 on-prem VM.
Steps to Reproduce the behavior (Include debug logs if possible and relevant)
Expected behavior No python errors reported in salt master log
Screenshots
Versions Report