[x] Search for duplicates among the existing issues, both open and closed.
Proposal
I found a case where a dynamic target got really close to finishing but did not while I still had workers up and waiting for work. What I suspect happened was that targets were allocated to workers that then disappeared due to the HPC time limit. What I would have liked to have happened was that drake would recognise that the worker has disappeared then send the target to another worker that is still around.
I believe this would require clustermq to be able to say which workers have disappeared via SLURM in my case.
Unfortunately, drake has no way of knowing which clustermq workers stopped unexpectedly or which target was running at the time. Maybe follow up on https://github.com/mschubert/clustermq/issues/101.
Prework
drake
's code of conduct.Proposal
I found a case where a dynamic target got really close to finishing but did not while I still had workers up and waiting for work. What I suspect happened was that targets were allocated to workers that then disappeared due to the HPC time limit. What I would have liked to have happened was that drake would recognise that the worker has disappeared then send the target to another worker that is still around.
I believe this would require clustermq to be able to say which workers have disappeared via SLURM in my case.