mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
171 stars 51 forks source link

Array jobs not enabled in makeClusterFunctionsSLURM? #188

Closed jgrn307 closed 6 years ago

jgrn307 commented 6 years ago

I'm trying to use an array job for the "tutorial":

library("batchtools")
reg = makeRegistry(file.dir = NA, seed = 1)

# Now add cluster?
reg$cluster.functions = makeClusterFunctionsSlurm()

piApprox = function(n) {
    nums = matrix(runif(2 * n), ncol = 2)
    d = sqrt(nums[, 1]^2 + nums[, 2]^2)
    4 * mean(d <= 1)
}
piApprox(1000)

ids = batchMap(fun = piApprox, n = rep(1e5, 10))

names(getJobTable())

submitJobs(resources = list(walltime = 60, memory = 1024,ncpus=1,chunks.as.array.jobs = T))

But when the jobs are submitted, I'm not seeing any array jobs, just each chunk is being sent as a separate job. Any ideas how to tweak this? I'm using the vanilla batchtools.slurm.tmpl template. The jobs do run, they just do so as separate jobs.

jgrn307 commented 6 years ago

Here's an output of one of the .job files (note the SBATCH array is not invoked):

#!/bin/bash
#SBATCH --job-name=job11554770bc9db6090aa867dc1556bbd8
#SBATCH --output=/data/gpfs/assoc/gears/scratch/jgreenberg/RtmpZZI9p4/registry353047e0a45b/logs/job11554770bc9db6090aa867dc1556bbd8.log
#SBATCH --error=/data/gpfs/assoc/gears/scratch/jgreenberg/RtmpZZI9p4/registry353047e0a45b/logs/job11554770bc9db6090aa867dc1556bbd8.log
#SBATCH --time=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1024

export DEBUGME=
Rscript -e 'batchtools::doJobCollection("/data/gpfs/assoc/gears/scratch/jgreenberg/RtmpZZI9p4/registry353047e0a45b/jobs/job11554770bc9db6090aa867dc1556bbd8.rds")'
jgrn307 commented 6 years ago

YA follow-up, I see "array.jobs" is no longer a parameter of makeClusterFunctionsSlurm in the latest git release -- what happened to it?

mllg commented 6 years ago

Array jobs are still somewhat experimental and underdocumented. For Slurm, you have to do the following:

  1. Set array.jobs to TRUE during the cluster function construction (as you already did). This just ensures that the command line parameter "-r" is passed to squeue. Sadly, this is required here because some clusters are configured to not support array jobs at all and then passing "-r" will throw an exception.
  2. Set the resource chunks.as.arrayjobs to TRUE (as you also already did).
  3. Chunk the jobs. You need to define which jobs are grouped together. This is done by passing a data.frame to submitJobs() with columns job.id and chunk. To create such a data frame, see http://mllg.github.io/batchtools/reference/chunk.html or create one yourself:
    ids = findJobs()
    # two chunks of 5 jobs each
    ids$chunk = rep(1:2, each = 5) 
    submitJobs(ids = ids, resources = list(...))

Please report back if you encounter further issues.

NB: The argument array.jobs is still there: https://github.com/mllg/batchtools/blob/master/R/clusterFunctionsSlurm.R#L29

wlandau commented 6 years ago

What about pipelines, as described here? Is that something batchtools will support at some point?

jgrn307 commented 6 years ago

Hmm, ok still not quite working. Now when I submit it following:

library("batchtools")
reg = makeRegistry(file.dir = NA, seed = 1)

# Now add cluster?
reg$cluster.functions = makeClusterFunctionsSlurm(template="~/test_batchtools/batchtools.slurm.tmpl")

piApprox = function(n) {
    nums = matrix(runif(2 * n), ncol = 2)
    d = sqrt(nums[, 1]^2 + nums[, 2]^2)
    4 * mean(d <= 1)
}
piApprox(1000)

ids = batchMap(fun = piApprox, n = rep(1e5, 10))

ids = findJobs()
# two chunks of 5 jobs each
ids$chunk = rep(1:2, each = 5) 
submitJobs(ids = ids, resources=list(ncpus=1,walltime=1,chunks.as.array.jobs=T,memory = 8192))

It does submit 2 jobs: Submitting 10 jobs in 2 chunks using cluster functions 'Slurm' ...

But the jobs are not array jobs:

#!/bin/bash
#SBATCH --job-name=job9d5c906f2d8a6c80bc6914454f08ae8c
#SBATCH --output=/data/gpfs/assoc/gears/scratch/jgreenberg/RtmpdeC3RD/registry256517957d372/logs/job9d5c906f2d8a6c80bc6914454f08ae8c.log
#SBATCH --error=/data/gpfs/assoc/gears/scratch/jgreenberg/RtmpdeC3RD/registry256517957d372/logs/job9d5c906f2d8a6c80bc6914454f08ae8c.log
#SBATCH --time=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=8192

export DEBUGME=
...

Note the lack of an array call line.

mllg commented 6 years ago

Does your template include a line lilke this one?

https://github.com/mllg/batchtools/blob/master/inst/templates/slurm-lido3.tmpl#L42

jgrn307 commented 6 years ago

Yep, I pulled the simple template and used it vanilla (except I modified the Rscript call to allow singularity use).

#!/bin/bash

## Job Resource Interface Definition
##
## ntasks [integer(1)]:       Number of required tasks,
##                            Set larger than 1 if you want to further parallelize
##                            with MPI within your job.
## ncpus [integer(1)]:        Number of required cpus per task,
##                            Set larger than 1 if you want to further parallelize
##                            with multicore/parallel within each task.
## walltime [integer(1)]:     Walltime for this job, in minutes.
##                            Must be at least 1 minute.
## memory   [integer(1)]:     Memory in megabytes for each cpu.
##                            Must be at least 100 (when I tried lower values my
##                            jobs did not start at all).
##
## Default resources can be set in your .batchtools.conf.R by defining the variable
## 'default.resources' as a named list.

<%
# relative paths are not handled well by Slurm
log.file = fs::path_expand(log.file)
-%>

#SBATCH --job-name=<%= job.name %>
#SBATCH --output=<%= log.file %>
#SBATCH --error=<%= log.file %>
#SBATCH --time=<%= ceiling(resources$walltime / 60) %>
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=<%= resources$ncpus %>
#SBATCH --mem-per-cpu=<%= resources$memory %>
<%= if (!is.null(resources$partition)) sprintf(paste0("#SBATCH --partition='", resources$partition, "'")) %>
<%= if (array.jobs) sprintf("#SBATCH --array=1-%i", nrow(jobs)) else "" %>

## Initialize work environment like
## source /etc/profile
## module add ...

## Export value of DEBUGME environemnt var to slave
export DEBUGME=<%= Sys.getenv("DEBUGME") %>

<%= sprintf("export OMP_NUM_THREADS=%i", resources$omp.threads) -%>
<%= sprintf("export OPENBLAS_NUM_THREADS=%i", resources$blas.threads) -%>
<%= sprintf("export MKL_NUM_THREADS=%i", resources$blas.threads) -%>

## Run R:
## we merge R output with stdout from SLURM, which gets then logged via --output option
singularity exec /data/gpfs/home/jgreenberg/gearslaboratory-gears-singularity-master-gears-general.simg Rscript -e 'batchtools::doJobCollection("<%= uri %>")'
mllg commented 6 years ago

What about pipelines, as described here? Is that something batchtools will support at some point?

Would be nice to have. I've opened an issue for this (#190).

jgrn307 commented 6 years ago

Where does the "brew"ing actually take place (which function)? It looks like maybe the array.jobs isn't being properly sent to the template?

jgrn307 commented 6 years ago

YA follow up --> I did some error checking and simply put:

<%= array.jobs %>

Into the template just to see what happens, and it returns "FALSE" -- so array.jobs is not being passed as true when building the template.

mllg commented 6 years ago

There was a typo in the documentation (cd5f34da8a311bfb84de70acda3607fbc1739d46). Please try again with chunks.as.arrayjobs.

Brewing: file is read into a character vector in https://github.com/mllg/batchtools/blob/master/R/clusterFunctions.R#L150 which is then passed to https://github.com/mllg/batchtools/blob/master/R/clusterFunctions.R#L181.

jgrn307 commented 6 years ago
> submitJobs(ids = ids, resources=list(ncpus=1,walltime=1,chunks.as.arrayjobs=T,memory = 8192))
Submitting 10 jobs in 2 chunks using cluster functions 'Slurm' ...
List of 4
 $ status  : int 0
 $ batch.id: chr [1:5] "124042_1" "124042_2" "124042_3" "124042_4" ...
 $ log.file: chr [1:5] "joba16a84dd57344bd1431e4ec04faab291.log_1" "joba16a84dd57344bd1431e4ec04faab291.log_2" "joba16a84dd57344bd1431e4ec04faab291.log_3" "joba16a84dd57344bd1431e4ec04faab291.log_4" ...
 $ msg     : chr "OK"
 - attr(*, "class")= chr "SubmitJobResult"
NULL
Error in submitJobs(ids = ids, resources = list(ncpus = 1, walltime = 1,  :
  Cluster function did not return a valid batch.id

Weird, this time it made job/array #1, submitted it, but then failed to submit job #2 (the second array job).

mllg commented 6 years ago

You found another bug 😢 . I don't have access to a cluster which supports array jobs, so this is quite hard for me to test. I'll fire up my docker containers and get this fixed tomorrow...

jgrn307 commented 6 years ago

Can you test by simply disabling the "submission" part of it (which is working fine) and just generate the job files? Thanks for your assistance! We have a machine that limits the number of concurrent jobs but allows each array to have 1000 entries (and we have to run about 10,000 unique tasks/jobs).

mllg commented 6 years ago

I've commited a potential bugfix. My docker containers are kind of outdated so I cannot test array jobs in them right now. However, I managed to get access to another cluster site which seems to support array jobs on a slurm scheduler. As soon as I get the credentials, I can properly debug and write some tests for array jobs to run on this system.

jgrn307 commented 6 years ago

I think that did it! I'll close this for now!

jgrn307 commented 6 years ago

Fix appears to work for SLURM systems.