statgen / SLURM-examples

85 stars 28 forks source link

SLURM-examples

The purpose of this repository is to collect examples of how to run SLURM jobs on CSG clusters. Students, visitors and staff members are welcome to use scripts from this repository in their work, and also contribute their own scripts.

If you want to share your SLURM script, then it is your responsibility to ensure that the script works and correctly allocates cluster resources.

Before allocating hundreds of jobs to the SLURM queue, it is a good idea to test your submission script using a small subset of your input files. Make sure that SLURM arguments for the number of CPUs, amount of memory and etc. are specified adequately and will not harm other users.

Table of contents

Submitting jobs

A simple job from the command line

Just run a command like:

sbatch --partition=nomosix --job-name=myjob --mem=4G --time=5-0:0 --output=myjob.slurm.log --wrap="Rscript /net/wonderland/home/foo/myscript.R"

SLURM has short versions for some of its options eg:

sbatch -p nomosix -J myjob --mem=4G -t 5-0:0 -o myjob.slurm.log --wrap="Rscript /net/wonderland/home/foo/myscript.R"

A simple job from a bash script

sbatch --partition=nomosix --job-name=myjob --mem=4G --time=5-0:0 --output=myjob.slurm.log --wrap="Rscript /net/wonderland/home/foo/myscript.R"

is equivalent to

sbatch myscript.sh

if myscript.sh contains:

#!/bin/bash
#SBATCH --partition=nomosix
#SBATCH --job-name=myjob
#SBATCH --mem=4G
#SBATCH --time=5-0:0
#SBATCH --output=myjob.slurm.log
Rscript /net/wonderland/home/foo/myscript.R

:frog: Bash script size must be less than 4MB. If you have a large script, then: (a) try using short versions of SLURM options, make your bash variable names short, avoid using long file paths and file names; (b) or try to split it.

A job that use multiple cores (on a single machine)

Some common single-node multi-threaded jobs:

:frog: If a job uses multiple threads, but you don't tell SLURM, SLURM will allocate too many jobs to that node. That will cause problems for all jobs on that node. Don't do that.

SLURM options for multi-threaded programs:

eg, sbatch --cpus-per-task=8 --mem=32G myjob.sh will allocate 8 CPUs (cores) on a single node with 32GB of RAM (4GB per thread) for a program named myjob.sh.

Making your job use the correct number of threads:

A job that use multiple nodes (ie, MPI)

MPI can use multiple cores across different nodes, so it can be scheduled differently than single-node multi-threaded jobs.

There are three main SLURM options for multi-process/multi-threaded programs:

eg, sbatch --ntasks=8 --cpus-per-task=1 --mem-per-cpu=4G myjob.sh will allocate 8 CPUs (cores), which can be on different nodes. Each will have 4GB of RAM, for a total of 32GB.

Many jobs

Simple job arrays

The best and recommended way to submit many jobs (>100) is using SLURM's jobs array feature. The job arrays allow managing big number of jobs more effectively and faster.

To specify job array use --array as follows:

  1. Tell SLURM how many jobs you will have in array:

    • --array=0-9. There are 10 jobs in array. The first job has index 0, and the last job has index 9.
    • --array=5-8. There are 4 jobs in array. The first job has index 5, and the last job has index 8.
    • --array=2,4,6. There are 3 jobs in array with indices 2, 4 and 6.
  2. Inside your script or command, use $SLURM_ARRAY_TASK_ID bash variable that stores index to the current job in array. For example, $SLURM_ARRAY_TASK_ID can be used to specify input file for job in array (job 0 will process input_file_0.txt, job 1 will process input_file_1.txt and so on):

sbatch --array=0-9 --wrap="Rscript myscript.R input_file_$SLURM_ARRAY_TASK_ID.txt"

To keep track of SLURM output files, you can use %a when specifiyng --output argument. Slurm replaces %a with the job number in array:

sbatch --array=0-9 --output=myoutput_%a.txt --wrap="Rscript myscript.R input_file_$SLURM_ARRAY_TASK_ID.txt"

SLURM allocates the same amount of memory (specified with --mem option) for every job within a job array. For example, the following command allocates 8Gb for each R process inside the job array:

sbatch --array=0-9 --mem=8000 --wrap="Rscript myscript.R input_file_$SLURM_ARRAY_TASK_ID.txt"

File of commands

Sometimes, it may be easier to write out a file with a list of command lines to execute, one per line. For example, your file could look like:

jobs.txt:

locuszoom --refsnp rs1
locuszoom --refsnp rs2
locuszoom --refsnp rs3

Now you can write a job submission script that looks like:

submit.sh:

#!/bin/bash

#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=500
#SBATCH --partition=nomosix
#SBATCH --array=1-3

srun $(head -n $SLURM_ARRAY_TASK_ID jobs.txt | tail -n 1)

Note in the above example --array=1-3 - the last number corresponds to the total number of jobs (or command lines) in your jobs.txt file. The head -n ... | tail -n 1 part is just a simple trick to read the $SLURM_ARRAY_TASK_IDth line from the file (for example, if the task ID is 3, head reads the first 3 lines from jobs.txt, and then tail takes the last line from those 3 lines, which is of course the 3rd line in the file.)

Now you can submit this to the cluster by doing:

sbatch submit.sh

Snakemake

You can also use a workflow/pipeline system such as Snakemake to submit your jobs to the cluster. This can have many advantages over manual job submission:

An example of using Snakemake to submit cluster jobs can be found in a number of places:

Job dependencies

Often we develop pipelines where a particular job must be launched only after previous jobs were successully completed. SLURM provides a way to implement such pipelines with its --dependency option:

Let's consider a pipeline with the following steps:

  1. split input data into N chunks;
  2. submit SLURM job array with N jobs - one job per data chunk;
  3. merge N output files into single final output file.

In this example, job (3) must be launched only after all jobs in step (2) are successfully finished. You can tell SLURM to automatically run job (3) after jobs in step (2) with the following bash script:

# On successfull job sumbission, SLURM prints new job identifier to standard output. We can use this job identifier to specify job dependency.

# Submit your job array.
slurm_message=$(sbatch --array=0-9 --wrap="Rscript myscript.R chunk_$SLURM_ARRAY_TASK_ID.txt")

# Extract job identifier from SLURM's message.
if ! echo ${message} | grep -q "[1-9][0-9]*$"; then
   echo "Job(s) submission failed."
   echo ${message}
   exit 1
else
   job=$(echo ${message} | grep -oh "[1-9][0-9]*$")
fi

# Submit merge script wich will be launched only if all jobs in prevously submitted job array are successfully completed.
sbatch --depend=afterok:${job_id} --wrap="Rscript mymergescript.R"

:frog: If a dependency condition was never satified, then the dependent job will remain in the SLURM queue with status DependencyNeverSatisfied. In this case, you need to cancel such job manually with scancel command.

Monitoring jobs

Two most important commands for monitoring your job status are squeue and scontrol show job.

Reviewing completed jobs

If you are curious how long a completed job ran for or how much memory it used, you can look that information with sacct. For example

For other available options, see the sacct documentation.

Canceling jobs

To cancel a job use scancel:

FAQ

  1. My job is close to its time limit. How can I extend it?

    Run scontrol update jobid=<job id> TimeLimit=<days>-<hours>:<minutes>. The new time limit must be greater than the current! Otherwise, SLURM will cancel your job immediately. If you don't have permission to run this command, then contact the administrator (in this case, please do this at least one day before the time limit expires).

  2. What to do if my job is pending (PD) with (job requeued in held state) or (JobHeldUser) message.

    Run scontrol release <job id>.

  3. I want to run some of my jobs before the others.

    You can achieve this by increasing nice value of your less important jobs using scontrol update jobid=<job id> nice=<value>. Default nice value is 0, jobs with higher nice value have lower priority and are executed after jobs with lower nice value.

  4. What are these terms?

    node = machine = computer

    number of threads ≈ number of cores ≈ number of cpus

  5. Where are the docs?

    here