mschubert / clustermq

R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH
https://mschubert.github.io/clustermq/
Apache License 2.0
146 stars 27 forks source link

Redeploy workers that time out too soon? #101

Closed wlandau closed 3 years ago

wlandau commented 6 years ago

In my experience, HPC systems in academic settings can have very restrictive wall time limits. It may be difficult in these environments to follow your recommendation to keep the same pool of reserved workers for an entire end-to-end project.

You may have already implemented workarounds, I do not know. But just in case, here are a couple ideas.

  1. If a worker times out, launch a new one and make it attempt the work of its predecessor. In fact, it may be nice to do this for crashed workers in general for a given number of retries.
  2. If a worker has been running for a certain (user-defined) length of time, make it restart before accepting any new jobs. This would be amazing to have for drake.
mschubert commented 6 years ago

My first thought about this is: I have never come across a system that you couldn't request at least a couple of days worth of walltime.

I'm inclined to put this off as user responsibility to request the appropriate time, or process their workflow in chunks that fit in it.

However, if this affects many users I'd be willing to reconsider.

wlandau commented 6 years ago

Related: https://github.com/ropensci/drake/issues/349