Closed ghost closed 11 months ago
Is there a solution? We have the same problem I did not notice this till I installed slurm 22.05 and now the srun
hangs till this process are finished. In older slurm versions it was no problem
We have not seen this on 22.05, to the best of my knowledge, but I will look again. But for what it's worth, going over the code again, I believe I can see exactly what the problem is. In fact, I'm not sure if I overlooked something or just wasn't paying close enough attention, but I don't really know how I expected it to work correctly at all, what with everything being in local variables...
Anyhoo...I will prioritize this fix. The gist of it is that nhcmain_watchdog_timer()
needs to use global arrays, not local variables, for keeping track of the NHC/subcommand PIDs and the sleep
PIDs, and the trap
calls need to be set for all the relevant PIDs.
I would love to get your feedback, either on the b08769bb workaround/fix on the 1.4.4-dev branch or the new 1.5 code on the PR #132 branch! I believe either will fix this problem.
In the absence of further information to the contrary, I believe this to be fixed, so I am closing this Issue. Please feel free to reopen if you're still seeing this behavior!
A simple test case:
nhc.conf
:Now if nhc is run on a node, the main process returns quickly, but it leaks a background nhc watchdog process which still has a TTY and is running sleep. After the sleep expires, it will trigger the watchdog and it will try to kill a process. In theory, if the machine is sufficiently busy or the timeout is sufficiently long, this can end up killing a process using a reused PID number.
from
ps axf
right after running nhc:This is a bit of an annoyance because it also causes this (because of it holding on to a TTY):
The watchdog process is spawned at https://github.com/mej/nhc/blob/master/scripts/lbnl_cmd.nhc#L169, and as far as I can see, there is nothing that tries to kill it after the command runs, and there probably should be.
In case bash version is relevant, this happens on version 4.2.46(2)-release (x86_64-redhat-linux-gnu).