mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
171 stars 51 forks source link

Implement `resubmitJobs()` #174

Open mllg opened 6 years ago

mllg commented 6 years ago

This will restart all jobs using the same resources (to fix #164) and defaults to expired jobs (as requested via mail). Can also be used in bt[lm]apply() to "resume" calculation.

ryananeff commented 5 years ago

I have some of my own code for doing this as a rough guide. This will resubmit the jobs multiple times until it hits a max_retries number.

message("Submitting unsubmitted jobs...")
batchtools::submitJobs(batchtools::findNotSubmitted()$job.id,
                       resources=res)

message("Waiting for jobs to complete...")
message(Sys.time())
job_retries = sapply(batchtools::findJobs()$job.id, function(x) {0})
while(length(batchtools::findNotDone()$job.id)>0){
    batchtools::waitForJobs(timeout=60)
    err = batchtools::findErrors()$job.id
    exp = batchtools::findExpired()$job.id
    if (length(err)>0){
        for(i in err){
            job_retries[i] = job_retries[i] + 1
            message(paste0("Found error in job ",i,
               ", restarting, retry attempt ",job_retries[i]))
            print(batchtools::getErrorMessages(i))
            batchtools::submitJobs(i,resources=res)
        }
    }
    if (length(exp)>0){
        for(i in exp){
            job_retries[i] = job_retries[i] + 1
            message(paste0("Found expired job ",i,
               ", restarting with ",1.25**job_retries[i],
               "x more resources, retry attempt ",job_retries[i]))
            res_job = res
            res_job$memory = round(res$memory*(1.25**job_retries[i]))
            res_job$walltime = round(res$walltime*(1.25**job_retries[i]))
            res_job$cores = round(res$cores*(1.25**job_retries[i]))
            #print(batchtools::getLog(i))
            batchtools::submitJobs(i,resources=res_job)
        }
    }
    if(max(job_retries)>=maxRetries){
        message("Maximum number of retries exceeded, stopping jobs...")
        batchtools::killJobs()
        reg <<- reg
        stop("Automatic retry failed, registry available for debugging at `reg`.")
    }
}
nick-youngblut commented 3 years ago

Are there any plans to implement this? One of snakemake's best features is that it can resubmit jobs with increased user-defined resources (eg., mem = attempt ** 3 + 10, with attempt iterating by 1 for each job attempt).

Using batchtools or clustermq, which don't have such a resubmit feature (AFAIK) can result in a lot of hassle when X% of 100's or 1000's of jobs are unsuccessful. I have to figure our which jobs failed, resubmit just those jobs with more resources, see which of those jobs failed, resubmit the failed jobs with more resources, etc.

For those that need/want resubmission of jobs and stumble upon this issue: I highly recommend snakemake (which can run R code), but it is often overkill for simpler tasks.