saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Get access to the Salt software package repository here:
https://repo.saltproject.io/
Apache License 2.0
14.09k stars 5.47k forks source link

maintenance process restarted #53548

Open tsaridas opened 5 years ago

tsaridas commented 5 years ago

Description of Issue

2019-06-18 08:30:32,832 [salt.utils.process:754 ][ERROR   ][116567] An un-handled exception from the multiprocessing process 'Maintenance-9' was caught:
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/salt/utils/process.py", line 747, in _run
    return self._original_run()
  File "/usr/lib/python2.7/site-packages/salt/master.py", line 223, in run
    salt.daemons.masterapi.clean_old_jobs(self.opts)
  File "/usr/lib/python2.7/site-packages/salt/daemons/masterapi.py", line 174, in clean_old_jobs
    mminion.returners[fstr]()
  File "/usr/lib/python2.7/site-packages/salt/returners/local_cache.py", line 441, in clean_old_jobs
    shutil.rmtree(t_path)
  File "/usr/lib64/python2.7/shutil.py", line 247, in rmtree
    rmtree(fullname, ignore_errors, onerror)
  File "/usr/lib64/python2.7/shutil.py", line 256, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/usr/lib64/python2.7/shutil.py", line 254, in rmtree
    os.rmdir(path)
OSError: [Errno 39] Directory not empty: '/var/cache/salt/master/jobs/94/81c947980cd522d331b9bd44d1f2a165599d1363953bc6e80ad80f0d20bb4a'
2019-06-18 08:30:37,003 [salt.utils.process:435 ][INFO    ][122058] Process <class 'salt.master.Maintenance'> (116567) died with exit status 1, restarting...

Versions Report

should affect all versions.

Akm0d commented 5 years ago

Thanks for reporting the issue! Can you give us the output of salt-call --local test.versions from the master and describe the module/state/command that was being invoked when this error was encountered?

tsaridas commented 5 years ago

I wrote that all versions should be affected. The maintenance process doesn't have anything to do with commands being invoked afaik.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

stale[bot] commented 4 years ago

Thank you for updating this issue. It is no longer marked as stale.

rmatte commented 3 years ago

We're seeing this on 3002.6 (likely not fixed in 3003 as well).

This is a traceback often seen in the log on one of our syndicates.

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/salt/utils/process.py", line 895, in wrapped_run_func
    return run_func()
  File "/usr/lib/python3/dist-packages/salt/master.py", line 249, in run
    salt.daemons.masterapi.clean_old_jobs(self.opts)
  File "/usr/lib/python3/dist-packages/salt/daemons/masterapi.py", line 165, in clean_old_jobs
    mminion.returners[fstr]()
  File "/usr/lib/python3/dist-packages/salt/returners/local_cache.py", line 445, in clean_old_jobs
    shutil.rmtree(f_path)
  File "/usr/lib/python3.8/shutil.py", line 722, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/usr/lib/python3.8/shutil.py", line 720, in rmtree
    os.rmdir(path)
OSError: [Errno 39] Directory not empty: '/var/cache/salt/master/jobs/6b/8bf745498c8a3ec534403e52c07d35059d4aeb3238386446f2ed746691e73b'

An important thing to note is that after we see this in the log we end up with a defunct process like this:

root     1288688  0.0  0.0      0     0 ?        Z    May20  47:59 [salt-master] <defunct>

and once that happens our salt job cache begins growing unbounded. I need to manually restart the salt-master service to force it to start cleaning up after itself again. If I don't then we eventually run out of inodes and/or disk space on whichever syndicate this is happening on. We have 3 syndicates and all of them have this same problem at different intervals.

We're currently considering just patching this code ourselves to avoid this condition since this is killing us right now. It seems like it should possibly be doing something a bit more aggressive than just an rmtree there. Or at least taking extra steps to clean out the directory if there's still stuff in it and not just completely bombing out with an exception like that.

That defunct process has another salt-master process as it's parent, so I have a suspicion that that's the process responsible for cleaning the cache that is just getting stuck in a bad state and requiring the parent process to be restarted to recover because it's not detecting this problem and self-remediating. Seems like a fairly serious bug that needs to be addressed.

tsaridas commented 3 years ago

I patched the master myself to avoid this error, should be easy just by adding an ingore errors option in rmtree.