Zombies created by workers / wedged worker threads

During some Lustre I/O issues on cedar, we ran into the following zombie process:

chimedat 37016  1.3  0.0 1094268 111908 pts/13 Ssl+ Oct23  38:44  \_ /home/chimedat/alpenvenv/bin/python /home/chimedat/alpenvenv/bin/alpenhornd
chimedat 37305  0.0  0.0 212200 12292 ?        Ssl  Oct23   0:02      \_ orted --hnp --set-sid --report-uri 8 --singleton-died-pipe 9 -mca state_novm_select 1 -mca ess hnp -mca pmix ^s1,s2,cray,isolated
chimedat 39198  0.0  0.0      0     0 pts/13   Z+   Oct24   0:00      \_ [bbcp] <defunct>

which was not being reaped by its worker, meaning the worker never finished it's task. This, in turn, prohibited further transfers into cedar_staging because the wedged worker was keeping the node busy.

We probably should have a mechanism for the main process to detect workers that have become wedged and deal with them.

Note: it's not clear to me what state the worker was in and why it wasn't able to reap the child. Outside this one stuck worder, the remainder of alpenhornd was working normally.

radiocosmology / alpenhorn

Zombies created by workers / wedged worker threads #175