saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Install Salt from the Salt package repositories here:
https://docs.saltproject.io/salt/install-guide/en/latest/
Apache License 2.0
14.22k stars 5.49k forks source link

[BUG] v3005rc2: Highstate slows to a crawl/never terminates with state_aggregate enabled #62439

Open frebib opened 2 years ago

frebib commented 2 years ago

Description Highstate on v3005rc2 is marginally slower than v3004, but still within the acceptable range (I measured about ~5% slower). Adding state_aggregate: true to minion conf causes the highstate to freeze for minutes/hours between state calls:

[INFO    ] The service foo is already running
[INFO    ] Completed state [foo] at time 11:02:59.918796 (duration_in_ms=109.407)
[DEBUG   ] Could not LazyLoad service.mod_aggregate: 'service.mod_aggregate' is not available.
[INFO    ] Running state [bar.service] at time 11:06:11.907213
[DEBUG   ] Could not LazyLoad file.mod_aggregate: 'file.mod_aggregate' is not available.
[INFO    ] Running state [/etc/bar/baz.conf] at time 11:07:52.187364

(logs abbreviated and redacted.) Note the timestamps. Later on I saw hangs lasting for at least 3 hours at which point I gave up and killed the job to retrieve the profiling data.

Setup Our highstate contains ~3.5k states and makes heavy use of various requisites to enforce correctness and ordering, because Salt makes little attempt to do it for us. This is a "standard" minion/master setup with highstates being invoked with ~ salt-call state.highstate. Running with -lall yields nothing more useful over -ldebug

Steps to Reproduce the behavior Enable state_aggregate: true in a sufficiently large codebase.

Expected behavior No freezing

Screenshots If applicable, add screenshots to help explain your problem.

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.) ```yaml Salt Version: Salt: 3005rc2 Dependency Versions: cffi: 1.14.5 dateutil: 2.8.1 Jinja2: 2.11.3 M2Crypto: 0.38.0 msgpack: 1.0.0 pycparser: 2.20 pycryptodome: 3.9.7 Python: 3.9.2 (default, Feb 28 2021, 17:03:44) PyYAML: 5.3.1 PyZMQ: 20.0.0 Tornado: 4.5.3 ZMQ: 4.3.4 System Versions: dist: debian 11 bullseye locale: utf-8 machine: x86_64 release: 5.15.54-cloudflare-2022.7.5 system: Linux version: Debian GNU/Linux 11 bullseye ```

Additional context I suspect that this is caused by https://github.com/saltstack/salt/commit/8168b25fe5906883a07de5bfdfefabc6d1f57784 and exacerbated by https://github.com/saltstack/salt/issues/59123 A profile of the running minion job process shows an overwhelming majority of CPU time (95%+) stuck in check_requisite which makes me think it's stuck either in an infinite loop, or something similar. image

I'm speculating that this could have always been broken and it's only now apparent since state_aggregate was fixed.

garethgreenaway commented 2 years ago

@frebib Thanks for the report. Can you provide a small subset of some examples states were you're seeing this behavior? Thanks!

OrangeDog commented 2 years ago

Did it do the same in v3004, or is this new behaviour?

frebib commented 2 years ago

@garethgreenaway I'm not sure this is related to any specific SLS configuration, but I will try to dig out a short series of states that execute in our state graph around a point where it freezes. My suspicion is that actually the cause is the considerable use of requisites that causes this, given my experience with the state compiler and what the profile output gives me. I'd expect that this can be reproduced on any sizeable state graph.

@OrangeDog I'll try backporting c9a6217fb73ccdbf1cfb80befb4c35df3b64af5c to v3004 and see if it reproduces. As it stands I have that option enabled on a handful of v3004 minions with no detriment but I also noticed zero improvement/change when enabling it, so it's still plausible that it's broken in 3004. I'll report back

frebib commented 2 years ago

It seems that v3004 is unaffected, so this must be a regression somehow. I am seeing [ERROR ] Recursive requisite found when enabling state_aggregate: true, even on 3004 (with c9a6217fb73ccdbf1cfb80befb4c35df3b64af5c applied) so I'll try to debug that further and see if it's related, or possibly the cause of this

frebib commented 2 years ago

Here is a simple reproducer that causes the Recursive requisite found error on 3004 with the backport applied

local:
----------
          ID: git
    Function: pkg.installed
      Result: False
     Comment: Recursive requisite found
     Changes:
----------
          ID: /tmp/foo
    Function: file.managed
      Result: False
     Comment: One or more requisite failed: test.sls.git
     Started: 09:14:09.194697
    Duration: 0.006 ms
     Changes:
----------
          ID: less
    Function: pkg.installed
      Result: False
     Comment: One or more requisite failed: test.sls./tmp/foo
     Started: 09:14:09.194833
    Duration: 0.002 ms
     Changes:

Summary for local
------------
Succeeded: 0
Failed:    3
------------
Total states run:     3
Total run time:   0.008 ms
jgroocock@133m11:~$ cat test.sls
/tmp/foo:
  file.managed:
    - content: This is some content

less:
  pkg.installed:
    - require:
      - file: /tmp/foo
git:
  pkg.installed:
    - require_in:
      - file: /tmp/foo

I guess this is either a bug in the implementation of the pkg mod_aggregate or (more likely IMHO) a fundamental flaw in the mod_aggregate system.

That doesn't seem to cause 3005 to freak out and spin forever though, so it's possibly not the cause of this particular bug.

MKLeb commented 2 years ago

Hi @frebib, been looking at this. I think the test.sls you gave is expected to fail due to recursive requisites since your pkg.installed states will be aggregated to both require /tmp/foo and be required in /tmp/foo if I'm understanding correctly.

As for your original report, were you able to gather a small collection of states that reproduces the issue? We are hoping to re-tag with whatever fixes are required for this by tomorrow.

frebib commented 2 years ago

due to recursive requisites since your pkg.installed states will be aggregated to both require /tmp/foo and be required in /tmp/foo

So this is precisely the problem that my example is trying to highlight. Without aggregating these states, the produced state graph is a correct DAG: pkg(git) -> file(/tmp/foo) -> pkg(less). The one thing that aggregating states should never do is modify the DAG such that it is no longer acyclic (https://en.wikipedia.org/wiki/Cycle_graph#Directed_cycle_graph) because that makes it impossible to find a start node (one with no incoming edges https://en.wikipedia.org/wiki/Directed_acyclic_graph#Topological_sorting_and_recognition) to begin traversal/execution due to the cycle.

I'm speculating that this phenomenon/fundamental flaw in the aggregation system is possibly the cause of the issue described in this bug report due to the cycles in the state graph that the compiler doesn't catch. My example above isn't sufficiently complex enough to trigger this issue due to the trivial size of the state graph, but one of considerably more complex size, say ~3500 states with 10s of thousands of requisites (and lots of pkg states) may be sufficiently complex enough to cause the state compiler to get stuck either in an infinite loop, or one where it won't complete in any reasonable amount of human time. This is purely speculation, but given the (at this point) very well understood shortcomings of the way that Salt handles requisites (#59123) and the captured profile screenshot above showing the CPU time spent in check_requisite I'm suggesting that they are possibly related and caused by the same area of code.

Inspecting the aggregated high (what does that even mean?) object for an aggregated pkg state in my highstate shows that it adds many recursive requisites with huge loops (with many many states) right at the time that the highstate locks up. On the basis that our highstate contains a lot of internal & sensitive/confidential data, it's not feasible to share it and it would take weeks of work to unpick it. I can however say that I'm pretty sure this should be easy enough to reproduce with:

The behaviour exhibited around the stalls is identical to that in #59123 of which we see normally (in 2019.2.x, 3004.x, without state_aggregate enabled), but for much, much longer. Here's an example from the logs at (-l all!):

Expand for some not very interesting logs 10:20:47,861 [INFO ][743293] Executing command systemctl in directory '/root' 10:20:47,892 [DEBUG ][743293] stdout: * auditd.service - Security Auditing Service Loaded: loaded (/etc/systemd/system/auditd.service; enabled; vendor preset: enabled) 10:20:47,894 [INFO ][743293] Executing command systemctl in directory '/root' 10:20:47,923 [DEBUG ][743293] stdout: active 10:20:47,925 [INFO ][743293] Executing command systemctl in directory '/root' 10:20:47,951 [DEBUG ][743293] stdout: enabled 10:20:47,951 [INFO ][743293] The service auditd is already running 10:20:47,952 [INFO ][743293] Completed state [auditd] at time 10:20:47.952219 (duration_in_ms=94.282) !! Note the ~7 minute pause here 10:28:03,221 [DEBUG ][743293] Could not LazyLoad service.mod_aggregate: 'service.mod_aggregate' is not available. 10:28:03,222 [INFO ][743293] Running state [redacted.service] at time 10:28:03.221996 10:28:03,222 [INFO ][743293] Executing state service.dead for [redacted.service] 10:28:03,224 [INFO ][743293] Executing command systemctl in directory '/root' 10:28:03,253 [DEBUG ][743293] stdout: Unit redacted.service could not be found. 10:28:03,254 [DEBUG ][743293] retcode: 4 10:28:03,254 [INFO ][743293] The named service redacted.service is not available 10:28:03,255 [INFO ][743293] Completed state [redacted.service] at time 10:28:03.255152 (duration_in_ms=33.155) !! Note the ~4 minute pause here 10:32:02,744 [DEBUG ][743293] Could not LazyLoad file.mod_aggregate: 'file.mod_aggregate' is not available. 10:32:02,745 [INFO ][743293] Running state [/etc/audit/plugins.d/redacted.conf] at time 10:32:02.745108 10:32:04,284 [INFO ][743293] The service systemd-timesyncd is already dead 10:32:04,285 [INFO ][743293] Completed state [systemd-timesyncd] at time 10:32:04.285189 (duration_in_ms=93.596) 10:32:04,297 [DEBUG ][743293] Could not LazyLoad service.mod_aggregate: 'service.mod_aggregate' is not available. 10:32:04,297 [INFO ][743293] Running state [ntp] at time 10:32:04.297888 10:32:04,298 [INFO ][743293] Executing state service.dead for [ntp] 10:32:04,299 [INFO ][743293] Executing command systemctl in directory '/root' 10:32:04,322 [DEBUG ][743293] stdout: Unit ntp.service could not be found. 10:32:04,323 [DEBUG ][743293] retcode: 4 10:32:04,323 [INFO ][743293] The named service ntp is not available 10:32:04,324 [INFO ][743293] Completed state [ntp] at time 10:32:04.324223 (duration_in_ms=26.333) !! Note ~64 minute(!!!!) pause here 11:36:10,200 [salt.utils.lazy :102 ][DEBUG ][743293] Could not LazyLoad file.mod_aggregate: 'file.mod_aggregate' is not available. 11:36:10,201 [salt.state :2159][INFO ][743293] Running state [/var/log/chrony] at time 11:36:10.201216 11:36:10,202 [salt.state :2190][INFO ][743293] Executing state file.absent for [/var/log/chrony]

My final thoughts here are that

garethgreenaway commented 2 years ago

I was able to spend a bit of time on this one and was able to reproduce the issue with the small example state. At first the exhibited behavior of producing an error about a recursive requisite seemed correct until I started to dig in more and realize it wasn't actually a recursive error. The issue seems to be the order things are happening when handling aggregation, using the small example state the following patch the states run in the desired manner:

diff --git a/salt/state.py b/salt/state.py
index 65cb7a64d7..ce206bea6b 100644
--- a/salt/state.py
+++ b/salt/state.py
@@ -864,7 +864,7 @@ class State:
             # the state function in the low we're looking at
             # and __agg__ is True, add the requisites from the
             # chunk to those in the low.
-            if chunk["state"] == low["state"] and chunk.get("__agg__"):
+            if chunk["state"] == low["state"] and chunk.get("__agg__") and low["name"] != chunk["name"]:
                 for req in frozenset.union(
                     *[STATE_REQUISITE_KEYWORDS, STATE_REQUISITE_IN_KEYWORDS]
                 ):
@@ -891,8 +891,8 @@ class State:
             agg_fun = "{}.mod_aggregate".format(low["state"])
             if agg_fun in self.states:
                 try:
-                    low = self.states[agg_fun](low, chunks, running)
                     low = self._aggregate_requisites(low, chunks)
+                    low = self.states[agg_fun](low, chunks, running)
                     low["__agg__"] = True
                 except TypeError:
                     log.error("Failed to execute aggregate for state %s", low["state"])

After this patch, any requisites for states that will be effected by aggregation will be calculated first and then the aggregation for the particular state is run. The patch above also ignores the requisite check if the chunk is the same state as the low that was passed in. Because of the aggregation that is happening with the pkg.installed state, the less package will be installed before the file.managed state runs.

frebib commented 2 years ago

I can confirm that the patch above certainly improves the situation reducing the 64 minute pause after the ntp.service state to just 3 minutes. Whilst still not ideal, that's near/within the realm of usability.

Aug 20 10:36:20 131dm3 salt-highstate[2182799]: [INFO    ] Completed state [ntp] at time 10:36:20.167489 (duration_in_ms=25.827)
<long (but shorter than before) pause here>
Aug 20 10:39:32 131dm3 salt-highstate[2182799]: [INFO    ] Running state [/var/log/chrony] at time 10:39:32.226601

I'll test this some more and assuming I see no weirdness/brokenness I'll carry this patch on our v3005rc2 branch and re-enable state_aggregate in some nodes as a longer test. Thanks for digging into this @garethgreenaway

Edit: Above was a different node type to the one where we see the worst impact, and I just checked and there is an improvement, but it's not as good: 64 minutes down to 10 minutes

Aug 20 10:51:12 131m16 salt-highstate[441069]: [INFO    ] Completed state [ntp] at time 10:51:12.947702 (duration_in_ms=64.516)
Aug 20 11:01:33 131m16 salt-highstate[441069]: [INFO    ] Running state [/var/log/chrony] at time 11:01:33.579171

I'm thinking that the pause here is possibly just the slow requisite calculation problem still and actually mod_aggregate just exacerbates that problem by the way it aggregates requisites. Either way, a big improvement still

Edit again: The highstate is still running (normally takes a couple minutes). I think this has just moved the delays to other points in the highstate instead. There was a very lengthy delay much later on after the couple minute pause of the ntp service state:

Aug 20 10:42:06 131dm3 salt-highstate[2182799]: [INFO    ] File /etc/resolv.conf is in the correct state
Aug 20 10:42:06 131dm3 salt-highstate[2182799]: [INFO    ] Completed state [/etc/resolv.conf] at time 10:42:06.196400 (duration_in_ms=19.164)
Aug 20 12:21:29 131dm3 salt-highstate[2182799]: [INFO    ] Running state [foo] at time 12:21:29.346176
Aug 20 12:21:29 131dm3 salt-highstate[2182799]: [INFO    ] Executing state user.present for [foo]
garethgreenaway commented 2 years ago

@frebib Thanks for the update and glad to see there is a bit of speed up for this. Are you able to share what the ntp state looks like, I'm curious what that delay could be doing. Does it rely on packages being installed? If there are a lot of them and they're being aggregated this could be taking awhile to run.

frebib commented 2 years ago

So I tried out the attached PR a few weeks ago (sorry about the silence, I've been busy with other things, including trying to get v3005 to work) In my test the change made no difference to the highstate runtime, and it still paused in exactly the same places as before, and for the same lengthy durations. I think the PR is worthwhile as it does appear to bring some small improvements to the state compiler aside from this issue, but it definitely does not solve the problem described in this issue ticket.

As for our configuration, I have eyed up the ntp state after passing it through parts of the state compiler and I can firmly say that showing you the contents of that SLS file would not help because most of the requisites are entirely unrelated due to being pulled in by the aggregation of requisites of many pkg states in many many other files. There is simply this:

# If ntp is running on the machine, kill that service (chrony is meant to replace it).
# Also uninstall the package
ntp:
  service.dead: []
  pkg.removed: []
frebib commented 2 years ago

62529 does not fix this. Can it be re-opened?

garethgreenaway commented 2 years ago

@frebib Thanks for the feedback, but without the states that are causing the issues for you being able to reproduce this issue is going to be very difficult.

garethgreenaway commented 1 year ago

@frebib Revisiting this one, I'm not sure the fix that was associated with this issue was ultimately correct and @MKLeb's comments about the result being the recursive requisite being correct as both the require and require_in requisites would end up associated with the packages state following the aggregation of the packages. I am curious how you saw the end data ending up where the aggregation would work even with this conflicting requisites.