mej / nhc

LBNL Node Health Check
Other
213 stars 78 forks source link

When using check_cmd_status, nhc leaks watchdog processes #104

Closed ghost closed 11 months ago

ghost commented 3 years ago

A simple test case:

nhc.conf:

* || TIMEOUT=120
* || check_cmd_status -t 100 -r 0 /bin/true

Now if nhc is run on a node, the main process returns quickly, but it leaks a background nhc watchdog process which still has a TTY and is running sleep. After the sleep expires, it will trigger the watchdog and it will try to kill a process. In theory, if the machine is sufficiently busy or the timeout is sufficiently long, this can end up killing a process using a reused PID number.

from ps axf right after running nhc:

240166 pts/0    S      0:00 /bin/bash /usr/sbin/nhc
240167 pts/0    S      0:00  \_ sleep 100

This is a bit of an annoyance because it also causes this (because of it holding on to a TTY):

$ time ssh <node> "time nhc"

real    0m0.037s
user    0m0.023s
sys     0m0.005s

real    1m40.472s
user    0m0.026s
sys     0m0.005s
$

The watchdog process is spawned at https://github.com/mej/nhc/blob/master/scripts/lbnl_cmd.nhc#L169, and as far as I can see, there is nothing that tries to kill it after the command runs, and there probably should be.

In case bash version is relevant, this happens on version 4.2.46(2)-release (x86_64-redhat-linux-gnu).

basvandervlies commented 1 year ago

Is there a solution? We have the same problem I did not notice this till I installed slurm 22.05 and now the srun hangs till this process are finished. In older slurm versions it was no problem

mej commented 1 year ago

We have not seen this on 22.05, to the best of my knowledge, but I will look again. But for what it's worth, going over the code again, I believe I can see exactly what the problem is. In fact, I'm not sure if I overlooked something or just wasn't paying close enough attention, but I don't really know how I expected it to work correctly at all, what with everything being in local variables...

Anyhoo...I will prioritize this fix. The gist of it is that nhcmain_watchdog_timer() needs to use global arrays, not local variables, for keeping track of the NHC/subcommand PIDs and the sleep PIDs, and the trap calls need to be set for all the relevant PIDs.

mej commented 1 year ago

I would love to get your feedback, either on the b08769bb workaround/fix on the 1.4.4-dev branch or the new 1.5 code on the PR #132 branch! I believe either will fix this problem.

mej commented 11 months ago

In the absence of further information to the contrary, I believe this to be fixed, so I am closing this Issue. Please feel free to reopen if you're still seeing this behavior!