python-diamond / Diamond

Diamond is a python daemon that collects system metrics and publishes them to Graphite (and others). It is capable of collecting cpu, memory, network, i/o, load and disk metrics. Additionally, it features an API for implementing custom collectors for gathering metrics from almost any source.
http://diamond.readthedocs.org/
MIT License
1.74k stars 601 forks source link

Diamond leaving around zombie children of collector processes #674

Open bakerkj opened 6 years ago

bakerkj commented 6 years ago

I see child processes of diamond collectors occasionally become and stay zombies. The processes that have become zombies are (ping and unbound-control). I am also seeing messages in the log "Took too long to run! Killed!" in the logs.

From looking at the code (I am running something very close to master as of May 16th with a few small modifications to some of the collectors not related to this problem). These external programs are invoked by the UnboundCollector and PingCollector which both use the run_command() found in collector.py. run_command() uses subprocess.Popen(command, stdout=subprocess.PIPE).communicate(). The collect loop is run via utils/scheduler.py:collector_process(). This also controls the timeout behavior with signal alarm and the timeout SIGALRMException prints the log message "Took too long to run! Killed!".

I believe what is happening is that the signal.alarm is triggered because these processes (ping and unbound-control) are sometimes taking longer than the allowed timeout to run. This alarm interrupts the wait() call inside of subprocess.communicate(). The python collector thread carries on to try again at some future time. The ping/unbound-control commands eventually die, but since the wait() call was interrupted their parent (the collector thread) never comes back to check their exit code and those processes are left around as Zombies.

I believe the correct thing to do would be for the collector thread (maybe in the run_command() method), in the case of an alarm, would be to abruptly kill and then wait on the child process before going on to do other things.

Has anyone else seen this issue?

Does my proposed solution sound reasonable?

Thanks!