mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
169 stars 51 forks source link

submitJobs() doesn't do anything on SLURM cluster #251

Closed tpilz closed 4 years ago

tpilz commented 4 years ago

Hi,

in the past I successfully worked with batchtools on a slurm HPC. However, now after a few months I updated batchtools and some other packages and suddely submitJobs() runs, but doesn't do anything.

Consider the piApprox example:

library(batchtools)

reg = makeRegistry(file.dir = "~/.batchtools/test", seed = 1)

piApprox = function(n) {
  nums = matrix(runif(2 * n), ncol = 2)
  d = sqrt(nums[, 1]^2 + nums[, 2]^2)
  4 * mean(d <= 1)
}

batchMap(fun = piApprox, n = rep(1e5, 10))

submitJobs()

Nothing happens. I.e. no jobs are submitted on the cluster and in R submitJobs() doesn't stop and doesn't show any message whatsoever. However, from the list of processes I can see that a new R process has been started at the login node and consumes some CPU load, but even after a few hours just doesn't come up with anything. I had a look at the rgistry directory reg$file.dir (which exists and is writable) but there all the directories are empty. Shouldn't there be a file in logs/ or jobs/?

Instead of submitJobs(), I also tried the following (I think it's more or less what submitJobs() does?):

jc = makeJobCollection(reg=reg)
tmpl = cfReadBrewTemplate("~/.config/batchtools/slurm.tmpl")
jobscript = cfBrewTemplate(reg=reg, tmpl, jc)
runOSCommand("sbatch", shQuote(jobscript))

Function runOSCommand() runs endlessly in the same way. jobscript in the rgistry directory has been created and contains Rscript -e 'batchtools::doJobCollection("~/.batchtools/test/jobs/jobabe64ea85d6def870f1e0bd2ac3c94bf.rds")', but the rds file doesn't exist. Maybe this is a hint at what's going wrong? Or could it be related to the system and not batchtools?

My sessionInfo():

R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: SUSE Linux Enterprise Server 12 SP3

Matrix products: default
BLAS/LAPACK: /p/system/packages/intel/parallel_studio_xe_2018_update1/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64_lin/libmkl_rt.so

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] batchtools_0.9.11 data.table_1.12.6

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1        prettyunits_1.0.2 withr_2.1.2       assertthat_0.2.1 
 [5] digest_0.6.18     zeallot_0.1.0     crayon_1.3.4      rappdirs_0.3.1   
 [9] R6_2.4.0          backports_1.1.4   magrittr_1.5      pillar_1.4.2     
[13] debugme_1.1.0     rlang_0.4.2       progress_1.2.0    stringi_1.4.3    
[17] fs_1.3.1          brew_1.0-6        checkmate_1.9.1   vctrs_0.2.0      
[21] tools_3.5.1       hms_0.5.2         compiler_3.5.1    pkgconfig_2.0.2  
[25] base64url_1.4     tibble_2.1.3
mllg commented 4 years ago

This is pretty hard to debug remotely.

jc = makeJobCollection(reg=reg) tmpl = cfReadBrewTemplate("~/.config/batchtools/slurm.tmpl") jobscript = cfBrewTemplate(reg=reg, tmpl, jc) runOSCommand("sbatch", shQuote(jobscript))

jc must be saved in jc$uri using saveRDS() (or you get the error that the rds file does not exist).

What does you batchtools config contain? Have you selected a backend/clusterfunction implementation?

tpilz commented 4 years ago

Even when generating the rds file with saveRDS(jc,jc$uri) the behaviour stays the same (runOSCommand("sbatch", shQuote(jobscript)) doesn't do anything; I have to kill the R process).

I use slurm clusterfunctions.

My template job file:

#!/bin/bash

#SBATCH --account=myGroup
#SBATCH --mail-user=myName
#SBATCH --mail-type=END,FAIL
#SBATCH --job-name=<%= job.name %>
#SBATCH --output=<%= log.file %>
#SBATCH --error=<%= log.file %>
#SBATCH --qos=<%= resources$qos %>
#SBATCH --time=<%= ceiling(resources$walltime / 60) %>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=<%= resources$ncpus %>

module load intel/2018.3
module load R/3.6.2

echo "job is being submitted"
## Run R:
## we merge R output with stdout from SLURM, which gets then logged via --output option
Rscript -e 'batchtools::doJobCollection("<%= uri %>")'

And the config.R:

message("File config.R is executed ...")
cluster.functions = makeClusterFunctionsSlurm("slurm")

default.resources = list(qos = "short", walltime = 86400, ncpus = 1)
mllg commented 4 years ago

Does the following work?

saveRDS(jc,jc$uri)
doJobCollection(jc$uri)

Can you start a job and manually run

Rscript -e 'batchtools::doJobCollection("[uri]")'

on the node?

This could also be a file system issue. Are you sure that the reg.dir is shared across all nodes? Can you try a different directory?

tpilz commented 4 years ago

Believe it or not, but as suddenly as it stopped working it now works again (all the mentioned possibilities including submitJobs()). I haven't changed anything. No idea what was going wrong, must have been related to the HPC that I am using.

Thanks for your support anyway.

mllg commented 4 years ago

Glad the system is running again. Re-open if the problem emerges again.