mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
171 stars 51 forks source link

max.concurrent.jobs issue with SLURM arrays #198

Open jgrn307 opened 6 years ago

jgrn307 commented 6 years ago

Hi there -- this should be an easy fix -- it looks like when resources is set with max.concurrent.jobs=N, and the job is run as a SLURM array job, it is not really working as expected, since if you lump, say, 1000 array tasks into the job, and only one job is submitted, you'll (practically speaking) get up to 1000 concurrent "jobs" (tasks in SLURM-speak, but really the same thing).

I'd like to suggest that if batchtools is running via SLURM and an array job, max.concurrent.jobs modifies the array setting properly following: https://slurm.schedmd.com/job_array.html

All you'd need to do is pass that resource variable to the line (in your .tmpl): <%= if (array.jobs) sprintf("#SBATCH --array=1-%i", nrow(jobs)) else "" %>

to be something like: <%= if (array.jobs) sprintf("#SBATCH --array=1-%i'%'%max.concurrent.jobs", nrow(jobs)) else "" %>

where max.concurrent is the number of concurrent files passed from batchtools resource, and note that it needs to be a "%" so I guess you need a "true" % in the string.

mllg commented 6 years ago

Hi there -- this should be an easy fix -- it looks like when resources is set with max.concurrent.jobs=N, and the job is run as a SLURM array job, it is not really working as expected, since if you lump, say, 1000 array tasks into the job, and only one job is submitted, you'll (practically speaking) get up to 1000 concurrent "jobs" (tasks in SLURM-speak, but really the same thing). I'd like to suggest that if batchtools is running via SLURM and an array job, max.concurrent.jobs modifies the array setting properly following: https://slurm.schedmd.com/job_array.html

Background on max.concurrent.jobs: This is intended if you want to leave some CPUs for other users, e.g. for makeshift SSH clusters. Sometimes it is even useful on managed systems when your colleagues need to get their computations done before a deadline and ask you to back off a bit.

The suggested Slurm option controls how many jobs of a job array may run concurrently. The batchtools option OTOH controls how many jobs or job arrays of the registry may be submitted (queued or running) concurrently. For the above use case you will find the batchtools option more useful.

If you need to limit how many jobs of a single array run concurrently (i.e. because all array jobs are heavy on the file system and would block each other), you can pass this option as a resource and use it in the template.