mschubert / clustermq

R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH
https://mschubert.github.io/clustermq/
Apache License 2.0
146 stars 27 forks source link

Add resubmission function #203

Open mschubert opened 4 years ago

mschubert commented 4 years ago

originally posted in #153 by @nick-youngblut

I'm a big fan of snakemake, which allows for automatic resubmission of failed jobs with increased resources (eg., mem = lambda wildcards, threads, attempt: attempt * 8 # Gb of memory doubles per attempt). It would be really awesome to have that feature in clustermq. For example, one could provide a function instead of a value for the template:

tmpl = list(job_mem = function(attempt) attempt * 8)
fx = function(x, y) x * 2 + y
Q(fx, x=1:10, const=list(y=10), n_jobs=10, job_size=1, template=tmpl)

One would also need a max_attempts parameter for Q().

nick-youngblut commented 3 years ago

As a simpler approach, clustermq could just keep a log of all jobs that have completed successfully (or just those that failed), and the user could then set a parameter in Q() to just run the previously failed jobs (eg., Q(just.failed=TRUE)). The user could then wrap Q() in a loop in which the resources in template are increased in each iteration of the loop.

nick-youngblut commented 3 years ago

It appears that clustermq cluster array jobs suffer from the issue that if one of the N parallel cluster jobs (eg., n_jobs=20) dies, then the other jobs continue, and clustermq doesn't keep the total number of jobs at 20 (unlike snakemake).

In my case, this means that I currently only have 4 of 20 (n_jobs=20) running, since 16 of the cluster jobs have died for one reason or another. Running 4 jobs isn't really efficient, or what I intended. These 4 remaining jobs have been running for >1 day, so I don't want to kill them and loose all of that computation.

What do others do in cases where some of their 100's or 1000's of jobs fail? Do they have to always figure out which failed and then re-run just those jobs? That's potentially a lot of extra code just to figure out failed jobs and re-run only those (or just re-run everything).