mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
171 stars 51 forks source link

conda env activation in template? #181

Open nick-youngblut opened 6 years ago

nick-youngblut commented 6 years ago

Researchers in my lab use conda to manage all bioinformatics software, including python and R packages. We have an available SGE cluster, and to submit jobs to the cluster that require a specific conda environment, we have to include the following the job submission script: source activate MY_CONDA_ENV. Can this line be added to the batchtools SGE template? Can the user specify the particular conda environment via makeClusterFunctionsSGE or submitJob?

mllg commented 6 years ago

You can pass any value to the template as a resource, e.g.

submitJobs(resources = list(walltime = 300, conda.env = "my.conda.env"))

Note that users can provide default values for these settings in their respective configuration files, e.g. by adding the following line to ~/.batchtools.conf.R:

default.resources = list(conda.env = "my.conda.env"))

In your custom template file you then just need to the line

source active <%= conda.env %>

which should active the the conda env on the nodes. I assume you already have a template for your site, alternatively you can take this https://github.com/mllg/batchtools/blob/master/inst/templates/sge-simple.tmpl for the start and just add the conda activation after the shebang (first line).

nick-youngblut commented 6 years ago

I'm getting the error:

Error in cat(conda.env) : object 'conda.env' not found
Error in cfBrewTemplate(reg, template, jc) :
  Error brewing template: Error in cat(conda.env) : object 'conda.env' not found

My config file:

$ cat ~/.batchtools.conf.R
default.resources = list(conda.env = "base")
cluster.functions = makeClusterFunctionsSGE(template = "~/.batchtools.tmpl")

My template file:

$ cat ~/.batchtools.tmpl
#!/bin/bash

## The name of the job, can be anything, simply used when displaying the list of running jobs
#$ -N <%= job.name %>

## Combining output/error messages into one file
#$ -j y

## Giving the name of the output log file
#$ -o <%= log.file %>

## One needs to tell the queue system to use the current directory as the working directory
## Or else the script may fail as it will execute in your top level home directory /home/username
#$ -cwd

## Use environment variables
#$ -V

## Use correct queue
# -q <%= resources$queue %>

export PATH=/ebio/abt3_projects/software/dev/miniconda3_dev/bin:$PATH
source activate <%= conda.env %>

## Export value of DEBUGME environemnt var to slave
export DEBUGME=<%= Sys.getenv("DEBUGME") %>

<%= sprintf("export OMP_NUM_THREADS=%i", resources$omp.threads) -%>
<%= sprintf("export OPENBLAS_NUM_THREADS=%i", resources$blas.threads) -%>
<%= sprintf("export MKL_NUM_THREADS=%i", resources$blas.threads) -%>

Rscript -e 'batchtools::doJobCollection("<%= uri %>")'
exit 0

The steps that caused the error:

$ R 
> library('batchtools')
Loading required package: data.table

data.table 1.10.4.3
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
Breaking change introduced in batchtools v0.9.6: The format of the returned data.table of the functions `reduceResultsDataTable()`, getJobTable()`, `getJobPars()`, and `getJobResources()` has changed. List columns are not unnested automatically anymore. To manually unnest tables, batchtools provides the helper function `unwrap()` now, e.g. `unwrap(getJobPars())`. The previously introduced helper function `flatten()` will be deprecated due to a name clash with `purrr::flatten()`.
>
> reg = makeRegistry(file.dir = NA, seed = 1)
Sourcing configuration file '/ebio/abt3/nyoungblut/.batchtools.conf.R' ...
Created registry in '/tmp/RtmpzgFPtX/registryc3bd73c6e778' using cluster functions 'SGE'
> piApprox = function(n) {
+  nums = matrix(runif(2 * n), ncol = 2)
+  d = sqrt(nums[, 1]^2 + nums[, 2]^2)
+  4 * mean(d <= 1)
+ }
> batchMap(fun = piApprox, n = rep(1e5, 10))
Adding 10 jobs ...
> submitJobs(resources = list(walltime = 300, conda.env="r_install"))
Submitting 10 jobs in 10 chunks using cluster functions 'SGE' ...
Error in cat(conda.env) : object 'conda.env' not found
Error in cfBrewTemplate(reg, template, jc) :
  Error brewing template: Error in cat(conda.env) : object 'conda.env' not found
> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] batchtools_0.9.8    data.table_1.10.4-3

loaded via a namespace (and not attached):
 [1] backports_1.1.2   magrittr_1.5      assertthat_0.2.0  R6_2.2.2
 [5] base64url_1.3     prettyunits_1.0.2 tools_3.3.2       withr_2.1.2
 [9] rappdirs_0.3.1    stringi_1.1.6     progress_1.1.2    checkmate_1.8.5
[13] digest_0.6.12     brew_1.0-6
mllg commented 6 years ago

My mistake. Try this line in your template:

source active <%= resources$conda.env %>
nick-youngblut commented 6 years ago

Thanks for the clarification! There seems to be something wrong with #$ -o <%= log.file %>. When I use #$ -o <%= log.file %> in the template, all jobs die as Eqw. If I change that line to #$ -o /path/to/a/log/file, then the jobs run correctly.

nick-youngblut commented 6 years ago

Also, all of my jobs are listed as expired:

> getStatus()
Status for 10 jobs:
  Submitted    : 10 (100.0%)
  -- Queued    :  0 (  0.0%)
  -- Started   :  0 (  0.0%)
  ---- Running :  0 (  0.0%)
  ---- Done    :  0 (  0.0%)
  ---- Error   :  0 (  0.0%)
  ---- Expired : 10 (100.0%)
nick-youngblut commented 6 years ago

The issue was that the default registry file path in /tmp/ wasn't accessible by the cluster nodes.

nick-youngblut commented 6 years ago

Now I have the problem of creating a temporary directory somewhere else. makeRegistry() doesn't allow for the directory to already exist. Also, if I want to run batchtools workflows in parallel (eg., in different Jupyter notebooks), I'm going to have to create different directories for each instance. I could use the uuid package to create a unique ID for each registry directory, but that seems like overkill.

mllg commented 6 years ago

For this reason you can set temp.dir in your configuration file, mine looks like this:

cluster.functions = makeClusterFunctionsSlurm("slurm-dortmund", array.jobs = FALSE)
default.resources = list(walltime = 300L, memory = 512L, ncpus = 1L)
temp.dir = "~/tmp"

However, you have to manually clean up from time to time.

mllg commented 6 years ago

Making this more clear: After setting the temp dir, you can use makeRegistry(file.dir = NA) to get exactly what you were aiming at with your UID workaround.

Does this work for you?

nick-youngblut commented 6 years ago

Yep, that worked. Thanks for the clarification!

nick-youngblut commented 6 years ago

One thing I have noticed is that if I use a non-existent conda environment (eg., a typo in the name), then the jobs become listed as "Expired", and there's no errors or error messages listed when using findErrors() and getErrorMessages(). Is there a way to see error messages for "expired" jobs?

mllg commented 6 years ago

I use a non-existent conda environment (eg., a typo in the name), then the jobs become listed as "Expired", and there's no errors or error messages listed when using findErrors() and getErrorMessages(). Is there a way to see error messages for "expired" jobs?

Unfortunately, no. I can catch errors while brewing the template on the master or after R is started on the slave.

You could try to catch invalid resources while brewing by adding some assertions to the beginning of your template file, e.g.

walltime = resources$walltime
stopifnot(is.numeric(walltime), length(walltime) == 1, !is.na(walltime), walltime > 0)

I don't know if you can determine if a conda env will be valid on the slave though.