Open frebib opened 2 years ago
@frebib Thanks for the report. Can you provide a small subset of some examples states were you're seeing this behavior? Thanks!
Did it do the same in v3004, or is this new behaviour?
@garethgreenaway I'm not sure this is related to any specific SLS configuration, but I will try to dig out a short series of states that execute in our state graph around a point where it freezes. My suspicion is that actually the cause is the considerable use of requisites that causes this, given my experience with the state compiler and what the profile output gives me. I'd expect that this can be reproduced on any sizeable state graph.
@OrangeDog I'll try backporting c9a6217fb73ccdbf1cfb80befb4c35df3b64af5c to v3004 and see if it reproduces. As it stands I have that option enabled on a handful of v3004 minions with no detriment but I also noticed zero improvement/change when enabling it, so it's still plausible that it's broken in 3004. I'll report back
It seems that v3004 is unaffected, so this must be a regression somehow. I am seeing [ERROR ] Recursive requisite found
when enabling state_aggregate: true
, even on 3004 (with c9a6217fb73ccdbf1cfb80befb4c35df3b64af5c applied) so I'll try to debug that further and see if it's related, or possibly the cause of this
Here is a simple reproducer that causes the Recursive requisite found
error on 3004 with the backport applied
local:
----------
ID: git
Function: pkg.installed
Result: False
Comment: Recursive requisite found
Changes:
----------
ID: /tmp/foo
Function: file.managed
Result: False
Comment: One or more requisite failed: test.sls.git
Started: 09:14:09.194697
Duration: 0.006 ms
Changes:
----------
ID: less
Function: pkg.installed
Result: False
Comment: One or more requisite failed: test.sls./tmp/foo
Started: 09:14:09.194833
Duration: 0.002 ms
Changes:
Summary for local
------------
Succeeded: 0
Failed: 3
------------
Total states run: 3
Total run time: 0.008 ms
jgroocock@133m11:~$ cat test.sls
/tmp/foo:
file.managed:
- content: This is some content
less:
pkg.installed:
- require:
- file: /tmp/foo
git:
pkg.installed:
- require_in:
- file: /tmp/foo
I guess this is either a bug in the implementation of the pkg mod_aggregate
or (more likely IMHO) a fundamental flaw in the mod_aggregate
system.
That doesn't seem to cause 3005 to freak out and spin forever though, so it's possibly not the cause of this particular bug.
Hi @frebib, been looking at this. I think the test.sls
you gave is expected to fail due to recursive requisites since your pkg.installed
states will be aggregated to both require /tmp/foo
and be required in /tmp/foo
if I'm understanding correctly.
As for your original report, were you able to gather a small collection of states that reproduces the issue? We are hoping to re-tag with whatever fixes are required for this by tomorrow.
due to recursive requisites since your
pkg.installed
states will be aggregated to both require/tmp/foo
and be required in/tmp/foo
So this is precisely the problem that my example is trying to highlight. Without aggregating these states, the produced state graph is a correct DAG: pkg(git) -> file(/tmp/foo) -> pkg(less)
. The one thing that aggregating states should never do is modify the DAG such that it is no longer acyclic (https://en.wikipedia.org/wiki/Cycle_graph#Directed_cycle_graph) because that makes it impossible to find a start node (one with no incoming edges https://en.wikipedia.org/wiki/Directed_acyclic_graph#Topological_sorting_and_recognition) to begin traversal/execution due to the cycle.
I'm speculating that this phenomenon/fundamental flaw in the aggregation system is possibly the cause of the issue described in this bug report due to the cycles in the state graph that the compiler doesn't catch. My example above isn't sufficiently complex enough to trigger this issue due to the trivial size of the state graph, but one of considerably more complex size, say ~3500 states with 10s of thousands of requisites (and lots of pkg states) may be sufficiently complex enough to cause the state compiler to get stuck either in an infinite loop, or one where it won't complete in any reasonable amount of human time. This is purely speculation, but given the (at this point) very well understood shortcomings of the way that Salt handles requisites (#59123) and the captured profile screenshot above showing the CPU time spent in check_requisite
I'm suggesting that they are possibly related and caused by the same area of code.
Inspecting the aggregated high
(what does that even mean?) object for an aggregated pkg state in my highstate shows that it adds many recursive requisites with huge loops (with many many states) right at the time that the highstate locks up.
On the basis that our highstate contains a lot of internal & sensitive/confidential data, it's not feasible to share it and it would take weeks of work to unpick it.
I can however say that I'm pretty sure this should be easy enough to reproduce with:
pkg.installed
, file.managed
, service.running
, cmd.run
which make up the vast majority of our highstate.The behaviour exhibited around the stalls is identical to that in #59123 of which we see normally (in 2019.2.x, 3004.x, without state_aggregate
enabled), but for much, much longer. Here's an example from the logs at (-l all
!):
My final thoughts here are that
state_aggregate
doesn't work currently, so I'm going to leave it disabled and carry onI was able to spend a bit of time on this one and was able to reproduce the issue with the small example state. At first the exhibited behavior of producing an error about a recursive requisite seemed correct until I started to dig in more and realize it wasn't actually a recursive error. The issue seems to be the order things are happening when handling aggregation, using the small example state the following patch the states run in the desired manner:
diff --git a/salt/state.py b/salt/state.py
index 65cb7a64d7..ce206bea6b 100644
--- a/salt/state.py
+++ b/salt/state.py
@@ -864,7 +864,7 @@ class State:
# the state function in the low we're looking at
# and __agg__ is True, add the requisites from the
# chunk to those in the low.
- if chunk["state"] == low["state"] and chunk.get("__agg__"):
+ if chunk["state"] == low["state"] and chunk.get("__agg__") and low["name"] != chunk["name"]:
for req in frozenset.union(
*[STATE_REQUISITE_KEYWORDS, STATE_REQUISITE_IN_KEYWORDS]
):
@@ -891,8 +891,8 @@ class State:
agg_fun = "{}.mod_aggregate".format(low["state"])
if agg_fun in self.states:
try:
- low = self.states[agg_fun](low, chunks, running)
low = self._aggregate_requisites(low, chunks)
+ low = self.states[agg_fun](low, chunks, running)
low["__agg__"] = True
except TypeError:
log.error("Failed to execute aggregate for state %s", low["state"])
After this patch, any requisites for states that will be effected by aggregation will be calculated first and then the aggregation for the particular state is run. The patch above also ignores the requisite check if the chunk is the same state as the low that was passed in. Because of the aggregation that is happening with the pkg.installed
state, the less
package will be installed before the file.managed
state runs.
I can confirm that the patch above certainly improves the situation reducing the 64 minute pause after the ntp.service
state to just 3 minutes. Whilst still not ideal, that's near/within the realm of usability.
Aug 20 10:36:20 131dm3 salt-highstate[2182799]: [INFO ] Completed state [ntp] at time 10:36:20.167489 (duration_in_ms=25.827)
<long (but shorter than before) pause here>
Aug 20 10:39:32 131dm3 salt-highstate[2182799]: [INFO ] Running state [/var/log/chrony] at time 10:39:32.226601
I'll test this some more and assuming I see no weirdness/brokenness I'll carry this patch on our v3005rc2 branch and re-enable state_aggregate in some nodes as a longer test. Thanks for digging into this @garethgreenaway
Edit: Above was a different node type to the one where we see the worst impact, and I just checked and there is an improvement, but it's not as good: 64 minutes down to 10 minutes
Aug 20 10:51:12 131m16 salt-highstate[441069]: [INFO ] Completed state [ntp] at time 10:51:12.947702 (duration_in_ms=64.516)
Aug 20 11:01:33 131m16 salt-highstate[441069]: [INFO ] Running state [/var/log/chrony] at time 11:01:33.579171
I'm thinking that the pause here is possibly just the slow requisite calculation problem still and actually mod_aggregate just exacerbates that problem by the way it aggregates requisites. Either way, a big improvement still
Edit again:
The highstate is still running (normally takes a couple minutes). I think this has just moved the delays to other points in the highstate instead. There was a very lengthy delay much later on after the couple minute pause of the ntp
service state:
Aug 20 10:42:06 131dm3 salt-highstate[2182799]: [INFO ] File /etc/resolv.conf is in the correct state
Aug 20 10:42:06 131dm3 salt-highstate[2182799]: [INFO ] Completed state [/etc/resolv.conf] at time 10:42:06.196400 (duration_in_ms=19.164)
Aug 20 12:21:29 131dm3 salt-highstate[2182799]: [INFO ] Running state [foo] at time 12:21:29.346176
Aug 20 12:21:29 131dm3 salt-highstate[2182799]: [INFO ] Executing state user.present for [foo]
@frebib Thanks for the update and glad to see there is a bit of speed up for this. Are you able to share what the ntp
state looks like, I'm curious what that delay could be doing. Does it rely on packages being installed? If there are a lot of them and they're being aggregated this could be taking awhile to run.
So I tried out the attached PR a few weeks ago (sorry about the silence, I've been busy with other things, including trying to get v3005 to work) In my test the change made no difference to the highstate runtime, and it still paused in exactly the same places as before, and for the same lengthy durations. I think the PR is worthwhile as it does appear to bring some small improvements to the state compiler aside from this issue, but it definitely does not solve the problem described in this issue ticket.
As for our configuration, I have eyed up the ntp
state after passing it through parts of the state compiler and I can firmly say that showing you the contents of that SLS file would not help because most of the requisites are entirely unrelated due to being pulled in by the aggregation of requisites of many pkg
states in many many other files. There is simply this:
# If ntp is running on the machine, kill that service (chrony is meant to replace it).
# Also uninstall the package
ntp:
service.dead: []
pkg.removed: []
@frebib Thanks for the feedback, but without the states that are causing the issues for you being able to reproduce this issue is going to be very difficult.
@frebib Revisiting this one, I'm not sure the fix that was associated with this issue was ultimately correct and @MKLeb's comments about the result being the recursive requisite being correct as both the require
and require_in
requisites would end up associated with the packages state following the aggregation of the packages. I am curious how you saw the end data ending up where the aggregation would work even with this conflicting requisites.
Description Highstate on v3005rc2 is marginally slower than v3004, but still within the acceptable range (I measured about ~5% slower). Adding
state_aggregate: true
to minion conf causes the highstate to freeze for minutes/hours between state calls:(logs abbreviated and redacted.) Note the timestamps. Later on I saw hangs lasting for at least 3 hours at which point I gave up and killed the job to retrieve the profiling data.
Setup Our highstate contains ~3.5k states and makes heavy use of various requisites to enforce correctness and ordering, because Salt makes little attempt to do it for us. This is a "standard" minion/master setup with highstates being invoked with ~
salt-call state.highstate
. Running with-lall
yields nothing more useful over-ldebug
Steps to Reproduce the behavior Enable
state_aggregate: true
in a sufficiently large codebase.Expected behavior No freezing
Screenshots If applicable, add screenshots to help explain your problem.
Versions Report
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.) ```yaml Salt Version: Salt: 3005rc2 Dependency Versions: cffi: 1.14.5 dateutil: 2.8.1 Jinja2: 2.11.3 M2Crypto: 0.38.0 msgpack: 1.0.0 pycparser: 2.20 pycryptodome: 3.9.7 Python: 3.9.2 (default, Feb 28 2021, 17:03:44) PyYAML: 5.3.1 PyZMQ: 20.0.0 Tornado: 4.5.3 ZMQ: 4.3.4 System Versions: dist: debian 11 bullseye locale: utf-8 machine: x86_64 release: 5.15.54-cloudflare-2022.7.5 system: Linux version: Debian GNU/Linux 11 bullseye ```Additional context I suspect that this is caused by https://github.com/saltstack/salt/commit/8168b25fe5906883a07de5bfdfefabc6d1f57784 and exacerbated by https://github.com/saltstack/salt/issues/59123 A profile of the running minion job process shows an overwhelming majority of CPU time (95%+) stuck in
check_requisite
which makes me think it's stuck either in an infinite loop, or something similar.I'm speculating that this could have always been broken and it's only now apparent since state_aggregate was fixed.