ropensci / drake

An R-focused pipeline toolkit for reproducibility and high-performance computing
https://docs.ropensci.org/drake
GNU General Public License v3.0
1.34k stars 128 forks source link

send a target to a second worker in clustermq parallelism #1287

Closed kendonB closed 4 years ago

kendonB commented 4 years ago

Prework

Proposal

I found a case where a dynamic target got really close to finishing but did not while I still had workers up and waiting for work. What I suspect happened was that targets were allocated to workers that then disappeared due to the HPC time limit. What I would have liked to have happened was that drake would recognise that the worker has disappeared then send the target to another worker that is still around.

I believe this would require clustermq to be able to say which workers have disappeared via SLURM in my case.

wlandau commented 4 years ago

Unfortunately, drake has no way of knowing which clustermq workers stopped unexpectedly or which target was running at the time. Maybe follow up on https://github.com/mschubert/clustermq/issues/101.