Closed phue closed 3 years ago
Just for reference: https://bugs.schedmd.com/show_bug.cgi?id=9036
The mentioned scancel
problem affects all slurm versions before 20.02.4
@pditommaso Could I ask for your opinion on this?
Sorry I was missing this. Following #807 the solution was using the SLURM_CLUSTERS
variable as reported in the docs
https://www.nextflow.io/docs/latest/executor.html?highlight=slurm_clusters#slurm
Would that work for you?
No worries :)
Unfortunately, the scancel
command in older slurm versions is ignorant to the SLURM_CLUSTERS
variable, while sbatch
and squeue
work fine.
The current setup is that a single head node is used for multiple slurm clusters, and in order to kill running jobs, scancel
needs to be invoked with the --clusters
flag
Seems a bit an installation very specific problem. What about creating a wrapper script for scancel
adding that option if missing?
I was also hoping to work around this by adding a simple bash alias to my .bashrc
such as
alias scancel='scancel --clusters foo'
But nextflow didn't seem to use that alias, I haven't taken a closer look yet as to why not
An alias won't work it should be a scancel
script reachable in the PATH, and this script should use an abs path to the real scancel
New feature
The slurm scheduler can be configured as a federated cluster, consisting of several "sub-clusters". For interacting with the scheduler and submit to specific clusters, one has to supply the cluster ID using the
-M
/--clusters
flag, or set theSLURM_CLUSTERS
environment variable. See also https://github.com/nextflow-io/nextflow/issues/807However, it seems that depending on the slurm version, not all slurm commands take this environment variable into account. For example, with slurm v18.08.9 the
scancel
command fails to kill jobs on sub-clusterfoo
if not supplied with the-M foo
flag.This can lead to unexpected behaviour with nextflow, because the nextflow process cannot cancel running jobs with the current implementation.
Usage scenario
I think this could be considered a slurm bug with older versions, however implementing this as a new directive would also add flexibility because it would open the possibility to submit to different sub-clusters within a nextflow pipeline.
Suggest implementation
https://github.com/nextflow-io/nextflow/blob/d8ae94bf7237d482db95993c9e31182f6f1a23d7/modules/nextflow/src/main/groovy/nextflow/executor/SlurmExecutor.groovy#L68-L76
I would suggest adding something like this, and then pass it on to
getSubmitCommandLine
,getKillCommand
andqueueStatusCommand
Alternative implementation
Another possibility would be to use the existing
clusterOptions
directive for that purpose, which then would also need to be passed to above functions.