radiocosmology / alpenhorn

Alpenhorn is a service for managing an archive of scientific data.
MIT License
2 stars 1 forks source link

Zombies created by workers / wedged worker threads #175

Open ketiltrout opened 10 months ago

ketiltrout commented 10 months ago

During some Lustre I/O issues on cedar, we ran into the following zombie process:

chimedat 37016  1.3  0.0 1094268 111908 pts/13 Ssl+ Oct23  38:44  \_ /home/chimedat/alpenvenv/bin/python /home/chimedat/alpenvenv/bin/alpenhornd
chimedat 37305  0.0  0.0 212200 12292 ?        Ssl  Oct23   0:02      \_ orted --hnp --set-sid --report-uri 8 --singleton-died-pipe 9 -mca state_novm_select 1 -mca ess hnp -mca pmix ^s1,s2,cray,isolated
chimedat 39198  0.0  0.0      0     0 pts/13   Z+   Oct24   0:00      \_ [bbcp] <defunct>

which was not being reaped by its worker, meaning the worker never finished it's task. This, in turn, prohibited further transfers into cedar_staging because the wedged worker was keeping the node busy.

We probably should have a mechanism for the main process to detect workers that have become wedged and deal with them.

Note: it's not clear to me what state the worker was in and why it wasn't able to reap the child. Outside this one stuck worder, the remainder of alpenhornd was working normally.

ketiltrout commented 10 months ago

There may be two somewhat separate ideas here, viz.:

The solution to these might be two separate things.