mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
170 stars 51 forks source link

SLURM, squeue command not found. #230

Closed bw4sz closed 5 years ago

bw4sz commented 5 years ago

First day with batchtools, i'm getting a non-descript SLURM error.

Running boilerplate code from ?SubmitJobs

library(batchtools)

#Batchtools tmp registry
reg = makeRegistry(file.dir = NA, seed = 1)
print(reg)
print("registry created")

print(reg)
# Toy function which creates a large matrix and returns the column sums
fun = function(n, p) colMeans(matrix(runif(n*p), n, p))

# Arguments to fun:
args = CJ(n = c(1e4, 1e5), p = c(10, 50)) # like expand.grid()
print(args)

#batchtools submission
reg$cluster.functions=makeClusterFunctionsSlurm(template = "detection_template.tmpl", array.jobs = TRUE,nodename = "localhost", scheduler.latency = 5, fs.latency = 65)
ids = batchMap(fun, args = args, reg = reg)

# Set resources: enable memory measurement
res = list(measure.memory = TRUE)

# Submit jobs using the currently configured cluster functions
submitJobs(ids, resources = res, reg = reg)

Running local on OSX everything works with a SOCK cluster.

on a SLURM HPC system.

> # Submit jobs using the currently configured cluster functions
> submitJobs(ids, resources = res, reg = reg)
Error: Listing of jobs failed (exit code 127);
cmd: 'squeue --user=$USER --states=R,S,CG --noheader --format=%i -r'
output:
command not found

I saw this error here, but its buried in a large thread, not obvious what the solution was. Which command is not found? squeue? running on host it looks fine on other jobs i'm running

[b.weinstein@dev2 analysis]$ squeue --user=$USER --states=R,S,CG --noheader --format=%i -r
36571678
36517633
36518232

here is the SLURM template. I've changed my account name, and other standard settings, such as the correct module load name for the cluster.

#!/bin/bash

## Modified from  https://github.com/mllg/batchtools/blob/master/inst/templates/

## Job Resource Interface Definition
##
## ntasks [integer(1)]:       Number of required tasks,
##                            Set larger than 1 if you want to further parallelize
##                            with MPI within your job.
## ncpus [integer(1)]:        Number of required cpus per task,
##                            Set larger than 1 if you want to further parallelize
##                            with multicore/parallel within each task.
## walltime [integer(1)]:     Walltime for this job, in seconds.
##                            Must be at least 60 seconds for Slurm to work properly.
## memory   [integer(1)]:     Memory in megabytes for each cpu.
##                            Must be at least 100 (when I tried lower values my
##                            jobs did not start at all).
##
## Default resources can be set in your .batchtools.conf.R by defining the variable
## 'default.resources' as a named list.

<%
# relative paths are not handled well by Slurm
log.file = fs::path_expand(log.file)
-%>

  # Job name and who to send updates to
  #SBATCH --mail-user=benweinstein2010@gmail.com
  #SBATCH --mail-type=FAIL,END
  #SBATCH --account=ewhite
  #SBATCH --partition=hpg2-compute
  #SBATCH --qos=ewhite-b   # Remove the `-b` if the script will take more than 4 days; see "bursting" below

  #SBATCH --job-name=<%= job.name %>
  #SBATCH --output=<%= log.file %>
  #SBATCH --error=<%= log.file %>
  #SBATCH --time=<%= ceiling(resources$walltime / 60) %>
  #SBATCH --ntasks=1
  #SBATCH --cpus-per-task=<%= resources$ncpus %>
  #SBATCH --mem-per-cpu=<%= resources$memory %>
  <%= if (!is.null(resources$partition)) sprintf(paste0("#SBATCH --partition='", resources$partition, "'")) %>
  <%= if (array.jobs) sprintf("#SBATCH --array=1-%i", nrow(jobs)) else "" %>

## Initialize work environment like
## source /etc/profile
## module add ...

## Export value of DEBUGME environemnt var to slave
export DEBUGME=<%= Sys.getenv("DEBUGME") %>

<%= sprintf("export OMP_NUM_THREADS=%i", resources$omp.threads) -%>
<%= sprintf("export OPENBLAS_NUM_THREADS=%i", resources$blas.threads) -%>
<%= sprintf("export MKL_NUM_THREADS=%i", resources$blas.threads) -%>

## Run R:
## we merge R output with stdout from SLURM, which gets then logged via --output option
echo "submitting job"
module load gcc/6.3.0 R/3.4.3 gdal/2.2.1
Rscript -e 'batchtools::doJobCollection("<%= uri %>")'
bw4sz commented 5 years ago

This was not a batchtools error, but a config error. For those coming here, remember that $PATH in shell and path checked from

Sys.which('squeue')

are not the same. Especially when called with relative paths.

JamesThompsonC commented 8 months ago

Thank you bw4sz, for reporting on the error.

You mention a config error. How exactly do I solve that? Is that a setting in R? Or for bash? Or slurm?

I'm not sure how to proceed. ~Jim