Too many jobs causes "BatchtoolsExpiration: Future ('<none>') expired"

nick-youngblut commented 5 years ago

Whenever I run approx. > 200 jobs on our SGE cluster, I get the following error:

BatchtoolsExpiration: Future ('<none>') expired (registry path /ebio/abt3_projects/software/dev/DeepMAsED/notebooks/04_contig_viewing/.future/20190812_101254-P53iu9/batchtools_2061698801)

If I manually batch the jobs into smaller groups (eg., 50 or 100 jobs), then I don't get this error.

Version: r-batchtools 0.9.11 r341hc070d10_0 conda-forge

~/.batchtools.conf.R file:

default.resources = list(h_rt = '00:59:00',
                         h_vmem = '4G',
             threads = '1',
                         conda.env = 'py3')
cluster.functions = makeClusterFunctionsSGE(template = "~/.batchtools.tmpl")
temp.dir = "/ebio/abt3_projects/temp_data/"

.batchtools.tmpl file:

#!/bin/bash
#$ -N <%= job.name %
#$ -j y
#$ -o <%= log.file %>
#$ -cwd
#$ -V
#$ -pe parallel <%= resources$threads %>
#$ -l h_rt=<%= resources$h_rt %>
#$ -l h_vmem=<%= resources$h_vmem %>

. ~/.bashrc
conda activate <%= resources$conda.env %>

## Export value of DEBUGME environemnt var to slave
export DEBUGME=<%= Sys.getenv("DEBUGME") %>

<%= sprintf("export OMP_NUM_THREADS=%i", resources$omp.threads) -%>
<%= sprintf("export OPENBLAS_NUM_THREADS=%i", resources$blas.threads) -%>
<%= sprintf("export MKL_NUM_THREADS=%i", resources$blas.threads) -%>

Rscript -e 'batchtools::doJobCollection("<%= uri %>")'
exit 0

Example of resources used:

resources = list(h_rt = '00:59:00',
                 h_vmem = '4G',
                 threads = 1,
                 conda.env = 'py3_batchtools')     # conda env with batchtools installed
plan(batchtools_sge, resources=resources, workers=50)

I'm using future_lapply() to run a function that calls system2()

mllg commented 5 years ago

Do you have the same issue if w/o future? Could you run btlapply() instead of future_lapply()?

HenrikBengtsson commented 1 year ago

This issue was posted on 2019-08-13. At that time, future.batchtools 0.8.0 was available. In that version, workers = +Inf was the default. Because of this, map-reduce functions such as future.apply::future_lapply(X, ...) and furrr::future_map(X, ...) would produce one future per element in X, which means there would also be one job per element in X.

Whenever I run approx. > 200 jobs on our SGE cluster, I get the following error: ...

This sounds like length(X) > 200.

If I manually batch the jobs into smaller groups (eg., 50 or 100 jobs), then I don't get this error. … plan(batchtools_sge, resources=resources, workers=50)

Yes, one can limit the number of workers that Futureverse sees. To set a smaller number of workers, say, 100, use:

plan(batchtools_sge, workers = 100)

That would cause future_lapply() to split up X into 100 equally-sized chunks, and submit each chunk as a job to the scheduler.

One can also control the number of default workers to, say 50, using R option future.batchtools.workers=50 or environment variable R_FUTURE_BATCHTOOLS_WORKERS=50, cf. https://future.batchtools.futureverse.org/reference/future.batchtools.options.html.

Note that, in future.tests 0.9.0 (2020-04-14), I changed workers to default to workers = 100.

mllg / batchtools

Too many jobs causes "BatchtoolsExpiration: Future ('<none>') expired" #240