[BUG] Race condition in master fileserver file_list cache rebuild

duckfez commented 2 years ago

Description

Executing a state on minion(s) will cause the file list cache on the master to be rebuilt an indeterminate number of times. There is high CPU utilization on the master because of the repeated rebuilds. The amount of time for a state run to complete is a function of the number of minions running the state, how many files are in the fileserver, and the number of times the master regenerates the filename cache.

Setup

Environment is in AWS EC2, t3.large for master, t3.small for 6 minions. There are no pillars to minimize pillar rendering from being a consideration in the performance.

I have tested two different configurations, one with gitfs+roots backends and one with just the roots backend.

In order to make the file_list rebuild process take a somewhat "long" time (seconds), I have a test git repo filled with about 55K files, which are just multiple copies of the salt source tree. The same 55K files are also copied into /srv/bigjunk on the master.

master config w/ gitfs:

gitfs_provider: gitpython

gitfs_remotes:
  - git@gitfs:duckfez/gitfs.git:
      - name: gitfs
      - base: main
      - mountpoint: salt://gitfs

fileserver_backend:
  - roots
  - gitfs

file_roots:
  base:
    - /srv/salt

log_level: warning
log_fmt_console: '%(asctime)s,%(msecs)03d [%(name)-17s][%(colorlevel)s] %(message)s'
log_level_logfile: all
log_fmt_logfile: '%(asctime)s,%(msecs)03d [%(process)d][%(name)-17s][%(levelname)-8s] %(message)s'

master config w/ roots only:

fileserver_backend:
  - roots
file_roots:
  base:
    - /srv/salt
    - /srv/bigjunk
log_level: warning
log_fmt_console: '%(asctime)s,%(msecs)03d [%(name)-17s][%(colorlevel)s] %(message)s'
log_level_logfile: all
log_fmt_logfile: '%(asctime)s,%(msecs)03d [%(process)d][%(name)-17s][%(levelname)-8s] %(message)s'

My dummy minimal test state, teststate.sls is:

teststate:
  test.succeed_without_changes:
    - name: "I am a test"

Steps to Reproduce the behavior

I'm tailing the master logs with this grep to capture specific messages relevant to this:

tail -f /var/log/salt/master  | egrep "(Returning file list from cache)|Updating roots fileserver cache|AES payload received with command _file_list|Master function call _file_list|Start rebuilding roots file_list cache"

Start with a list of minions:

# salt '*' test.ping --out=txt
master: True
ip-172-31-0-64.ec2.internal: True
i-052bc1517d7f5a0f7.ec2.internal: True
i-0034e65490b9bce51.ec2.internal: True
i-00206eddd83fe120a.ec2.internal: True
i-0632a5fe83779f6b5.ec2.internal: True
i-06d3dd31d057dcb92.ec2.internal: True

Run a state against just one, and measure wall time of it:

# time salt 'ip*' state.sls teststate --state-verbose=False --state-output=terse
ip-172-31-0-64.ec2.internal:

Summary for ip-172-31-0-64.ec2.internal
------------
Succeeded: 1
Failed:    0
------------
Total states run:     1
Total run time:   1.327 ms

real    0m18.120s
user    0m1.188s
sys     0m0.169s

Logs from this:

2021-12-30 19:23:33,075 [332892][salt.master                              ][DEBUG   ] Updating roots fileserver cache (master)
2021-12-30 19:23:46,485 [328902][salt.master                              ][TRACE   ] AES payload received with command _file_list
2021-12-30 19:23:46,485 [328902][salt.loaded.int.fileserver.roots         ][TRACE   ] Start rebuilding roots file_list cache
2021-12-30 19:24:02,325 [328902][salt.master                              ][TRACE   ] Master function call _file_list took 15.839807271957397 seconds
2021-12-30 19:24:34,525 [332892][salt.master                              ][DEBUG   ] Updating roots fileserver cache (master)
2021-12-30 19:25:35,824 [332892][salt.master                              ][DEBUG   ] Updating roots fileserver cache (master)

Run same state against all 6 minions, but do it a few times to catch the race

# time salt 'i*' state.sls teststate --state-verbose=False --state-output=terse
[ERROR   ] Message timed out
Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.

real    0m47.044s
user    0m1.011s
sys     0m0.132s

logs:

2021-12-30 19:26:50,361 [328910][salt.master                              ][TRACE   ] AES payload received with command _file_list
2021-12-30 19:26:50,362 [328910][salt.loaded.int.fileserver.roots         ][TRACE   ] Start rebuilding roots file_list cache
2021-12-30 19:26:50,442 [328912][salt.master                              ][TRACE   ] AES payload received with command _file_list
2021-12-30 19:26:50,454 [328914][salt.master                              ][TRACE   ] AES payload received with command _file_list
2021-12-30 19:26:50,571 [328902][salt.master                              ][TRACE   ] AES payload received with command _file_list
2021-12-30 19:26:50,646 [328913][salt.master                              ][TRACE   ] AES payload received with command _file_list
2021-12-30 19:26:53,445 [328912][salt.loaded.int.fileserver.roots         ][TRACE   ] Start rebuilding roots file_list cache
2021-12-30 19:26:53,466 [328914][salt.loaded.int.fileserver.roots         ][TRACE   ] Start rebuilding roots file_list cache
2021-12-30 19:26:53,574 [328902][salt.loaded.int.fileserver.roots         ][TRACE   ] Start rebuilding roots file_list cache
2021-12-30 19:26:53,656 [328913][salt.loaded.int.fileserver.roots         ][TRACE   ] Start rebuilding roots file_list cache
2021-12-30 19:27:34,991 [328910][salt.master                              ][TRACE   ] Master function call _file_list took 44.618460178375244 seconds
2021-12-30 19:27:35,442 [328910][salt.master                              ][TRACE   ] AES payload received with command _file_list
2021-12-30 19:27:35,443 [328910][salt.fileserver                          ][DEBUG   ] Returning file list from cache: age=1 cache_time=20 /var/cache/salt/master/file_lists/roots/base.p
2021-12-30 19:27:37,595 [328910][salt.master                              ][TRACE   ] Master function call _file_list took 2.1520118713378906 seconds
2021-12-30 19:27:38,613 [332892][salt.master                              ][DEBUG   ] Updating roots fileserver cache (master)
2021-12-30 19:27:43,871 [328913][salt.master                              ][TRACE   ] Master function call _file_list took 53.22390127182007 seconds
2021-12-30 19:27:44,370 [328902][salt.master                              ][TRACE   ] Master function call _file_list took 53.798537492752075 seconds
2021-12-30 19:27:44,635 [328912][salt.master                              ][TRACE   ] Master function call _file_list took 54.19292092323303 seconds
2021-12-30 19:27:44,902 [328914][salt.master                              ][TRACE   ] Master function call _file_list took 54.44801068305969 seconds

Notice how we rebuilt the file_list cache 5 times (out of 6), 4 of the 5 kicked off basically concurrently. Where running the state on 1 minion finished in 18 seconds, doing it on 6 took 47.

Now if I run 2 back-to-back in quick succession I get a slightly different result. The cache from the prior run is still valid, and none of the worker threads attempt to rebuild it, so the state runs on all 6 minions in about 6 seconds.

(v) [root@i-0292831be31ca19e2 srv]# time salt 'i*' state.sls teststate --state-verbose=False --state-output=terse
[ERROR   ] Message timed out
Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.

real    0m47.156s
user    0m1.068s
sys     0m0.129s
(v) [root@i-0292831be31ca19e2 srv]# time salt 'i*' state.sls teststate --state-verbose=False --state-output=terse
i-052bc1517d7f5a0f7.ec2.internal:

Summary for i-052bc1517d7f5a0f7.ec2.internal
------------
Succeeded: 1
Failed:    0
------------
Total states run:     1
Total run time:   1.731 ms
ip-172-31-0-64.ec2.internal:

Summary for ip-172-31-0-64.ec2.internal
------------
Succeeded: 1
Failed:    0
------------
Total states run:     1
Total run time:   1.493 ms
i-0034e65490b9bce51.ec2.internal:

Summary for i-0034e65490b9bce51.ec2.internal
------------
Succeeded: 1
Failed:    0
------------
Total states run:     1
Total run time:   1.561 ms
i-06d3dd31d057dcb92.ec2.internal:

Summary for i-06d3dd31d057dcb92.ec2.internal
------------
Succeeded: 1
Failed:    0
------------
Total states run:     1
Total run time:   2.647 ms
i-00206eddd83fe120a.ec2.internal:

Summary for i-00206eddd83fe120a.ec2.internal
------------
Succeeded: 1
Failed:    0
------------
Total states run:     1
Total run time:   1.812 ms
i-0632a5fe83779f6b5.ec2.internal:

Summary for i-0632a5fe83779f6b5.ec2.internal
------------
Succeeded: 1
Failed:    0
------------
Total states run:     1
Total run time:   1.501 ms

real    0m5.736s
user    0m0.993s
sys     0m0.142s

The logs:

2021-12-30 19:29:32,075 [328912][salt.master                              ][TRACE   ] AES payload received with command _file_list
2021-12-30 19:29:32,076 [328912][salt.loaded.int.fileserver.roots         ][TRACE   ] Start rebuilding roots file_list cache
2021-12-30 19:29:32,197 [328913][salt.master                              ][TRACE   ] AES payload received with command _file_list
2021-12-30 19:29:32,304 [328914][salt.master                              ][TRACE   ] AES payload received with command _file_list
2021-12-30 19:29:32,313 [328910][salt.master                              ][TRACE   ] AES payload received with command _file_list
2021-12-30 19:29:32,375 [328902][salt.master                              ][TRACE   ] AES payload received with command _file_list
2021-12-30 19:29:35,200 [328913][salt.loaded.int.fileserver.roots         ][TRACE   ] Start rebuilding roots file_list cache
2021-12-30 19:29:35,311 [328914][salt.loaded.int.fileserver.roots         ][TRACE   ] Start rebuilding roots file_list cache
2021-12-30 19:29:35,322 [328910][salt.loaded.int.fileserver.roots         ][TRACE   ] Start rebuilding roots file_list cache
2021-12-30 19:29:35,391 [328902][salt.loaded.int.fileserver.roots         ][TRACE   ] Start rebuilding roots file_list cache
2021-12-30 19:29:44,614 [333562][salt.master                              ][DEBUG   ] Updating roots fileserver cache (master)
2021-12-30 19:30:19,010 [328912][salt.master                              ][TRACE   ] Master function call _file_list took 46.93488335609436 seconds
2021-12-30 19:30:19,447 [328912][salt.master                              ][TRACE   ] AES payload received with command _file_list
2021-12-30 19:30:19,448 [328912][salt.fileserver                          ][DEBUG   ] Returning file list from cache: age=1 cache_time=20 /var/cache/salt/master/file_lists/roots/base.p
2021-12-30 19:30:21,693 [328912][salt.master                              ][TRACE   ] Master function call _file_list took 2.2455427646636963 seconds
2021-12-30 19:30:26,412 [328910][salt.master                              ][TRACE   ] Master function call _file_list took 54.09862756729126 seconds
2021-12-30 19:30:26,911 [328914][salt.master                              ][TRACE   ] Master function call _file_list took 54.60709810256958 seconds
2021-12-30 19:30:27,207 [328902][salt.master                              ][TRACE   ] Master function call _file_list took 54.83136487007141 seconds
2021-12-30 19:30:27,221 [328913][salt.master                              ][TRACE   ] Master function call _file_list took 55.02363920211792 seconds
2021-12-30 19:30:35,928 [328912][salt.master                              ][TRACE   ] AES payload received with command _file_list
2021-12-30 19:30:35,929 [328912][salt.fileserver                          ][DEBUG   ] Returning file list from cache: age=8 cache_time=20 /var/cache/salt/master/file_lists/roots/base.p
2021-12-30 19:30:35,976 [328914][salt.master                              ][TRACE   ] AES payload received with command _file_list
2021-12-30 19:30:35,976 [328914][salt.fileserver                          ][DEBUG   ] Returning file list from cache: age=8 cache_time=20 /var/cache/salt/master/file_lists/roots/base.p
2021-12-30 19:30:36,057 [328913][salt.master                              ][TRACE   ] AES payload received with command _file_list
2021-12-30 19:30:36,057 [328913][salt.fileserver                          ][DEBUG   ] Returning file list from cache: age=9 cache_time=20 /var/cache/salt/master/file_lists/roots/base.p
2021-12-30 19:30:36,343 [328910][salt.master                              ][TRACE   ] AES payload received with command _file_list
2021-12-30 19:30:36,343 [328910][salt.fileserver                          ][DEBUG   ] Returning file list from cache: age=9 cache_time=20 /var/cache/salt/master/file_lists/roots/base.p
2021-12-30 19:30:37,603 [328912][salt.master                              ][TRACE   ] Master function call _file_list took 1.6741526126861572 seconds
2021-12-30 19:30:37,980 [328914][salt.master                              ][TRACE   ] Master function call _file_list took 2.0038652420043945 seconds
2021-12-30 19:30:37,984 [328912][salt.master                              ][TRACE   ] AES payload received with command _file_list
2021-12-30 19:30:37,985 [328912][salt.fileserver                          ][DEBUG   ] Returning file list from cache: age=10 cache_time=20 /var/cache/salt/master/file_lists/roots/base.p
2021-12-30 19:30:38,011 [328913][salt.master                              ][TRACE   ] Master function call _file_list took 1.9535884857177734 seconds
2021-12-30 19:30:38,332 [328914][salt.master                              ][TRACE   ] AES payload received with command _file_list
2021-12-30 19:30:38,333 [328914][salt.fileserver                          ][DEBUG   ] Returning file list from cache: age=11 cache_time=20 /var/cache/salt/master/file_lists/roots/base.p
2021-12-30 19:30:38,408 [328910][salt.master                              ][TRACE   ] Master function call _file_list took 2.065243721008301 seconds
2021-12-30 19:30:39,246 [328912][salt.master                              ][TRACE   ] Master function call _file_list took 1.2613167762756348 seconds
2021-12-30 19:30:39,486 [328914][salt.master                              ][TRACE   ] Master function call _file_list took 1.1531543731689453 seconds
2021-12-30 19:30:49,726 [333562][salt.master                              ][DEBUG   ] Updating roots fileserver cache (master)
2021-12-30 19:31:51,151 [333562][salt.master                              ][DEBUG   ] Updating roots fileserver cache (master)

Expected behavior

The master should regenerate the file_list cache for a given backend only once per fileserver_cache_list_time.

Screenshots

N/A

Versions Report

salt --versions-report

(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.) Master is running the master branch from git as of commit 306aa6dd29 with some local changes to add in additional debug logging for my help. Minions are running 3004 from the RPM repo. ``` Salt Version: Salt: 3003rc1+1373.g306aa6dd29 Dependency Versions: cffi: Not Installed cherrypy: Not Installed dateutil: Not Installed docker-py: Not Installed gitdb: 4.0.9 gitpython: 3.1.20 Jinja2: 3.0.3 libgit2: Not Installed M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.3 msgpack-pure: Not Installed mysql-python: Not Installed pycparser: Not Installed pycrypto: 2.6.1 pycryptodome: 3.12.0 pygit2: Not Installed Python: 3.6.8 (default, Sep 9 2021, 07:49:02) python-gnupg: Not Installed PyYAML: 6.0 PyZMQ: 22.3.0 smmap: 5.0.0 timelib: Not Installed Tornado: 4.5.3 ZMQ: 4.3.4 System Versions: dist: rhel 8.5 Ootpa locale: UTF-8 machine: x86_64 release: 4.18.0-305.el8.x86_64 system: Linux version: Red Hat Enterprise Linux 8.5 Ootpa ```

Additional context Add any other context about the problem here.

waynew commented 2 years ago

@duckfez thanks for the report! Definitely looks interesting. One thing I'm curious about - have you tried this same test without using gitfs? (maybe add way more files to the /srv/bigjunk or something)

I know that gitfs can have some... "surprising" side-effects. I wonder if this behavior is exacerbated by gitfs, or actually caused by it :thinking:

duckfez commented 2 years ago

@waynew yeah, I tried it both with gitfs+roots and just roots. And I think (not being an expert) that the race is there either way. The logs in my second test attempt above are with roots only (gitfs disabled).

My suspicion is that gitfs does exacerbate it just because it takes gitfs longer / more CPU cycles to produce the file_list cache. But, with roots alone, the underlying problem is still there. Like you said, I just had to add a BUNCH of files to /srv/bigjunk in order to get roots looking at enough files to really see the problem. Also, roots can take (more) advantage of the kernel filesystem cache than gitfs, maybe?

I've thought about this (without looking at the code) and have been considering taking the write lock just before checking to see if the cache needs to be refreshed. Then, release it once the refresh is done. What I think is happening right now is all of the master workers thunder off to in parallel to see "is the cache too old? If so, I should lock it and rebuild". Only one gets to the lock, but they've all committed by that point to seeing the rebuild through ... so each one in sequence locks it and rebuilds it.

duckfez commented 2 years ago

@waynew I did some experimenting here, and have made things both "better" and "worse" by adjusting the locking code. I found two "issues":

The cache isn't locked when being checked to see if it needs to be rebuilt, so multiple workers do attempt to rebuild it in parallel
If the lock is held more than a few seconds then another worker will forcibly remove the lock and proceed onward with the rebuild itself

I have a "fix" for both of these, but then end result is several workers all block waiting for the rebuild being done by the one. Master CPU utilization is better, minion throughput is a (little) better but the master still becomes unresponsive because all of the workers get tied up.

I have ideas for a different implementation, but it's a much bigger change and I would want to consult with you all more closely before just running off on some tangent.

Caine142 commented 1 year ago

@duckfez I am actually running into the same issue on my end for 3006.1 as well, was there any solution to prevent the constant cache rebuilds?

duckfez commented 1 year ago

Hi @Caine142 I don't know of one. I changed jobs earlier this year and in the new job we are not using saltstack at this time, so it's sorta fallen off my radar. One thing that we did do that helps a lot was using the --batch option to salt. Effectively, we'd batch for a number close to the number of cores in the salt master. That way, you never overloaded the salt master. The CPU usage would still go to 100%, but it would not cause a lot of job queueing. This might be a possible workaround for you?

I had plans to write a SEP to substantially change the file list cache functionality in a couple of ways:

Move the creation of the cache out into a Maintenance Worker, rather than being in-line with the calls to the list files operation
Make the file list cache lock free by doing some type of multi-version concurrency, probably using symlinks

Both of these are large enough changes that they needed to go through a SEP, but I've not actually written it.

saltstack / salt

[BUG] Race condition in master fileserver file_list cache rebuild #61407