Closed tpilz closed 4 years ago
This is pretty hard to debug remotely.
jc = makeJobCollection(reg=reg) tmpl = cfReadBrewTemplate("~/.config/batchtools/slurm.tmpl") jobscript = cfBrewTemplate(reg=reg, tmpl, jc) runOSCommand("sbatch", shQuote(jobscript))
jc
must be saved in jc$uri
using saveRDS()
(or you get the error that the rds file does not exist).
What does you batchtools config contain? Have you selected a backend/clusterfunction implementation?
Even when generating the rds file with saveRDS(jc,jc$uri)
the behaviour stays the same (runOSCommand("sbatch", shQuote(jobscript))
doesn't do anything; I have to kill the R process).
I use slurm clusterfunctions.
My template job file:
#!/bin/bash
#SBATCH --account=myGroup
#SBATCH --mail-user=myName
#SBATCH --mail-type=END,FAIL
#SBATCH --job-name=<%= job.name %>
#SBATCH --output=<%= log.file %>
#SBATCH --error=<%= log.file %>
#SBATCH --qos=<%= resources$qos %>
#SBATCH --time=<%= ceiling(resources$walltime / 60) %>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=<%= resources$ncpus %>
module load intel/2018.3
module load R/3.6.2
echo "job is being submitted"
## Run R:
## we merge R output with stdout from SLURM, which gets then logged via --output option
Rscript -e 'batchtools::doJobCollection("<%= uri %>")'
And the config.R:
message("File config.R is executed ...")
cluster.functions = makeClusterFunctionsSlurm("slurm")
default.resources = list(qos = "short", walltime = 86400, ncpus = 1)
Does the following work?
saveRDS(jc,jc$uri)
doJobCollection(jc$uri)
Can you start a job and manually run
Rscript -e 'batchtools::doJobCollection("[uri]")'
on the node?
This could also be a file system issue. Are you sure that the reg.dir
is shared across all nodes? Can you try a different directory?
Believe it or not, but as suddenly as it stopped working it now works again (all the mentioned possibilities including submitJobs()
). I haven't changed anything. No idea what was going wrong, must have been related to the HPC that I am using.
Thanks for your support anyway.
Glad the system is running again. Re-open if the problem emerges again.
Hi,
in the past I successfully worked with batchtools on a slurm HPC. However, now after a few months I updated batchtools and some other packages and suddely
submitJobs()
runs, but doesn't do anything.Consider the
piApprox
example:Nothing happens. I.e. no jobs are submitted on the cluster and in R
submitJobs()
doesn't stop and doesn't show any message whatsoever. However, from the list of processes I can see that a new R process has been started at the login node and consumes some CPU load, but even after a few hours just doesn't come up with anything. I had a look at the rgistry directoryreg$file.dir
(which exists and is writable) but there all the directories are empty. Shouldn't there be a file inlogs/
orjobs/
?Instead of
submitJobs()
, I also tried the following (I think it's more or less whatsubmitJobs()
does?):Function
runOSCommand()
runs endlessly in the same way.jobscript
in the rgistry directory has been created and containsRscript -e 'batchtools::doJobCollection("~/.batchtools/test/jobs/jobabe64ea85d6def870f1e0bd2ac3c94bf.rds")'
, but therds
file doesn't exist. Maybe this is a hint at what's going wrong? Or could it be related to the system and not batchtools?My
sessionInfo()
: