it-novum / openITCOCKPIT

openITCOCKPIT is an Open Source system monitoring tool built for different monitoring engines like Nagios, Naemon and Prometheus.
https://openitcockpit.io/
GNU General Public License v3.0
263 stars 50 forks source link

Make important services resilient against oom-killer #1686

Open kbilev opened 2 months ago

kbilev commented 2 months ago

Is your feature request related to a problem? Please describe. We have one satellite dedicated to check_nwc_health checks for a specific sort of devices. If for some reason, those devices are not reachable anymore due to an network outage or other, the checks will timeout and the check queue will fill up until the host has reaches maximum CPU or is out of memory. Now the oom-killer wants to free up memory (normal behaviour), but will probably kill the "wrong" processes like mysqld or gearmand. We are now in a loop, and the satellite is not able anymore to come back to a normal situation. A hard reboot of the satellite does not solve the problem, as the check queue will fill up to fast and the oom-killer kills the wrong processes again.

Describe the solution you'd like Maybe a good idea is to make the system relevant processes more resilient against the oom-killer by adding a parameter to the unit files.

[Service]
OOMScoreAdjust=-1000

We do not have tested the setting yet, so we cannot say if it really resolves the problem

Describe alternatives you've considered To get the satellite back online, you have to kill check processes until the system is able again to process all of them ps -ef | grep 'perl' | grep -v grep | awk '{print $2}' | head -n 300 | xargs -r kill -9