saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Install Salt from the Salt package repositories here:
https://docs.saltproject.io/salt/install-guide/en/latest/
Apache License 2.0
14.19k stars 5.48k forks source link

[BUG] Maintenance process leaking memory in 3005.1 #63747

Closed denschub closed 1 year ago

denschub commented 1 year ago

Description This appears to be a regression in 3005.1, as I didn't see this before. The "maintenance" process grows relatively quickly, until it eventually gets OOM'ed.

Setup (Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)

Please be as specific as possible and give set-up details.

Steps to Reproduce the behavior Nothing special. It's just a master running. Nothing of note in the logs either.

Expected behavior n/a

Screenshots

Screenshot 2023-02-21 at 03 06 21

(green is memory in use, yellow is disk caches, rest is.. rest.)

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.) ```yaml Salt Version: Salt: 3005.1 Dependency Versions: cffi: 1.15.1 cherrypy: Not Installed dateutil: Not Installed docker-py: Not Installed gitdb: 4.0.10 gitpython: 3.1.29 Jinja2: 3.1.2 libgit2: 1.5.0 M2Crypto: 0.38.0 Mako: Not Installed msgpack: 1.0.4 msgpack-pure: Not Installed mysql-python: Not Installed pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.12.0 pygit2: 1.11.1 Python: 3.10.9 (main, Dec 19 2022, 17:35:49) [GCC 12.2.0] python-gnupg: Not Installed PyYAML: 6.0 PyZMQ: 24.0.1 smmap: 5.0.0 timelib: Not Installed Tornado: 4.5.3 ZMQ: 4.3.4 System Versions: dist: arch locale: utf-8 machine: x86_64 release: 6.1.12-1-lts system: Linux version: Arch Linux ```

Additional context This appears to be a regression in 3005.1, but I can't 100% verify this right now :/

welcome[bot] commented 1 year ago

Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey. Please be sure to review our Code of Conduct. Also, check out some of our community resources including:

There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar. If you have additional questions, email us at saltproject@vmware.com. We’re glad you’ve joined our community and look forward to doing awesome things with you!

denschub commented 1 year ago

Might be related to #58791, but that one is so old, I felt like it's better to file a new one.

OrangeDog commented 1 year ago

More likely related to #62706.

whytewolf commented 1 year ago

I have not been able to replicate this. What settings are setup in the master?

whytewolf commented 1 year ago

here is a list of items in the maintenance process. that might help with knowing which settings to share.

  1. aes key rotation. rotates the aes encryption every 24 hours
  2. key_cache evaluation. creates a msgpack file with a list of accepted keys
  3. git_pillar fetch. run a fetch remotes on git_pillar systems
  4. scheduler, runs the master scheduler.
  5. presence, fires presence events. if presence events are setup.

This is done on loop_interval.

denschub commented 1 year ago

So, there's a couple of things. I'm using ext_pillar, where one is a local git repo, and another one is actually a cmd_json. I also have three gitfs_remotes. It's ahrd to disable any of them because that effectively breaks everything -- but if you want me to disable something to check if the memory issue still happens, I can make that happen.

Full master config for reference ```yaml default_include: master.d/*.conf interface: '::' ipv6: True state_output: mixed file_roots: base: - /srv/salt fileserver_backend: - git - roots gitfs_pubkey: '/root/.ssh/id_ed25519.pub' gitfs_privkey: '/root/.ssh/id_ed25519' gitfs_remotes: - file:///var/lib/gitolite/repositories/states.git - https://github.com/saltstack-formulas/openssh-formula.git - https://github.com/saltstack-formulas/sudoers-formula.git ext_pillar: - git: - master file:///var/lib/gitolite/repositories/pillar.git - cmd_json: /etc/letsencrypt/tools/salty-letsencrypt-pillar %s git_pillar_pubkey: '/root/.ssh/id_ed25519.pub' git_pillar_privkey: '/root/.ssh/id_ed25519' # /etc/salt/master.d/reactor.conf reactor: - 'salt/fileserver/gitfs/update': - /srv/reactor/update_fileserver.sls ```
whytewolf commented 1 year ago

Thank you. I"ll see what i can do with these. the only one that should matter with the maintenance processes is the git_pillar. all the others are handled through other processes.

speaking of which. how large is the local repo for your git_pillar?

for clarity, how many minions are connected to the master? in a breakdown of acceptance would be best.

whytewolf commented 1 year ago

humm. one thing i am noticing. you are not using ssh for any of your gits so you might be able to change library without much issue.

I noticed you setup pubkey and privkey even though your not using ssh with gitfs or git_pillar. you can remove those settings as those are only for ssh based git. and ignored otherwise.

since you are not actually using the more advanced git authentication methods currently. can you try switching the git library you use. you might need to install the other library for this. and set the config to use it. if your using pygit2 switch to GitPython or vise versa.

if the problem remains it most likely would be something in git_pillar. if it doesn't it is the library in use. either way please update us.

denschub commented 1 year ago

Thanks for your help so far. Much appreciated!

how large is the local repo for your git_pillar?

A clone of the entire repo is ~550KiB. 18 files, 530'ish lines.

how many minions are connected to the master? in a breakdown of acceptance would be best.

18 minions, all accepted. No denied/unaccepted/rejected.

I noticed you setup pubkey and privkey even though your not using ssh with gitfs or git_pillar.

That is true. I used to use ssh for gitfs, but had an issue with that a while ago, and switched to just pointing it to the local directory. However, that issue is no longer valid as far as I know. I have just pointed the gitfs back to ssh://, and will report if that resolves the issue.

If it doesn't, I'll switch from pygit2 to GitPython and report back the results of that!

whytewolf commented 1 year ago

humm. defiantly shouldn't be using that much memory. and to test i setup a local git server to run a small pillar though and then set the git_pillar update interval to 2 seconds. and i don't seem to be having any kind of increase in mem. :/ that worries me more. as that might mean it is something else. are you using any jinja in those pillar? doing any file importing?

denschub commented 1 year ago

No Jinja, no file importing. It's all just pretty boring YAML. :/ Looking at my memory graph just now, it's clear that switching back to ssh didn't work:

Screenshot 2023-03-21 at 23 28 11

I flipped back my git_pillar to file:// and switched to gitpython. Will report back in 12 hours or so!

denschub commented 1 year ago

Okay, I'm flabbergasted. I switched to gitpython yesterday, and memory usage has been stable. To verify, I switched back to pygit2 at 16:00 UTC, and sure enough, it's immediately back at eating memory:

Screenshot 2023-03-22 at 19 45 44

So the issue is either in libgit2, pygit2, or in the Salt code calling it. :/ There currently is an update to libgit2 1.6.3 in Archlinux' testing, which I'll update to as soon as I can. At this moment, however, libgit2 1.5.1 and pygit2 1.11.1 are the versions this reproduces on, and the newest I have available.

whytewolf commented 1 year ago

interesting. my own testing system is running salt 3006rc2 with libgit 1.5.0 and pygit2 1.11.1 and I can't replicate it.

looking at the changelog for libgit2 there were a couple of mem leak fixes in 1.6.1.

but that makes me wonder how am i not seeing it. unless the mem leak was introduced in 1.5.1

denschub commented 1 year ago

My initial bug report was filed with libgit2 1.5.0, so it's not a 1.5.1 regression I'm afraid. Maybe 3006 fixed it by accident? That's unlikely, but heh. Given that I can work around this just fine for myself by switching to gitpython, I'm happy to wait until that's the 1.6.1 libgit2 update is available for me to test. I'd love to be able to provide more useful information, but I'm sadly not enough into Python to know an approach to get useful memory traces...

whytewolf commented 1 year ago

:/ humm I don't know then. I doubt 3006 changed anything that would fix it. the git_pillar code hasn't changed in almost 2 years and the only change to the utils.gitfs code which git_pillar uses was about version information. everything else is much older than the 2 years. there has to be another variable in play that we are overlooking.

The fact it is happening in the maintenance thread means it is localized to the git fetch.

cmcmarrow commented 1 year ago

@denschub I believe #64072 should fix your leak. If you find that you still see the leak please reopen the ticket. Thank you for your patience.