which was not being reaped by its worker, meaning the worker never finished it's task. This, in turn, prohibited further transfers into cedar_staging because the wedged worker was keeping the node busy.
We probably should have a mechanism for the main process to detect workers that have become wedged and deal with them.
Note: it's not clear to me what state the worker was in and why it wasn't able to reap the child. Outside this one stuck worder, the remainder of alpenhornd was working normally.
During some Lustre I/O issues on cedar, we ran into the following zombie process:
which was not being reaped by its worker, meaning the worker never finished it's task. This, in turn, prohibited further transfers into
cedar_staging
because the wedged worker was keeping the node busy.We probably should have a mechanism for the main process to detect workers that have become wedged and deal with them.
Note: it's not clear to me what state the worker was in and why it wasn't able to reap the child. Outside this one stuck worder, the remainder of alpenhornd was working normally.