snakemake / snakemake-executor-plugin-slurm

A Snakemake executor plugin for submitting jobs to a SLURM cluster
MIT License
18 stars 19 forks source link

Unsuccessful attempt to use "srun" after job submission leads to abort of jobs #91

Closed Robin1em closed 6 months ago

Robin1em commented 6 months ago

When I try to run a snakefile with slurm, the jobs get submitted but stop after 3-4 seconds without creating the desired output.

The snakefile has the following content:

rule all:
        input: "1.txt", "2.txt", "3.txt"   

rule create:
        output: "{wildcard}.txt"
        shell: "sleep 60 > {wildcards.wildcard}.log 2> {wildcards.wildcard}.err && touch {wildcards.wildcard}.txt"

I run it with this command:

snakemake --cores 2  -p --executor slurm --jobs 10 --default-resources mem_mb=1000 runtime=10 cpus_per_task=2

The log looks as follows and it seems like snakemake wants to use „srun“ after submitting the jobs but can’t find the command, although it is there.

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided remote nodes: 1
Provided resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954
Select jobs to execute...
Execute 1 jobs...

[Fri May 10 14:12:03 2024]
rule create:
    output: 1.txt
    jobid: 0
    reason: Forced execution
    wildcards: wildcard=1
    resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, runtime=10

/bin/sh: 1: srun: not found
[Fri May 10 14:12:04 2024]
Error in rule create:
    jobid: 0
    output: 1.txt

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Storing output in storage.
WorkflowError:
At least one job did not complete successfully.
cmeesters commented 6 months ago

Hi,

srun is not found?? Usually, you should see that jobs get submitted, do you see this?

Does sbatch hello_world.sh work with hello_world.sh being:

#!/bin/bash

srun echo "Hello World from $(hostname)"

Usually, there should be pragmas like #SBATCH -p <partition>, but I assume your cluster has a default account and default partition, because it is omitted in the shown command line of yours.

iimog commented 6 months ago

Hi, I'm working with @Robin1em on this project, thank you for your quick response. Indeed, the jobs get submitted (squeue shows them) but fail after a couple of seconds. The srun not found error is in the individual log files.

The setup of our cluster consists of a single node, that we connect to submit our jobs with sbatch or srun. The job is then executed on one of several compute nodes. Submitting jobs from the compute nodes is not supported.

Thus, when running your hello_world.sh via sbatch the slurm log shows:

/var/spool/slurmd/job42513/slurm_script: line 3: srun: command not found

When omitting the srun in the script, everything works as expected and the hostname of the compute node is printed. Is it necessary to use srun on the compute nodes?

PS: Yes, the cluster has a default account and partition, so we can submit without specifying these.

cmeesters commented 6 months ago

Hi,

Is it necessary to use srun on the compute nodes?

As srun is a) the default MPI starter under SLURM and b) recent versions of SLURM confine c-groups to a default of 1 core (see issue #41) and c) it helps to differentiate error sources (the SLURM script or a job step) and d) SLURM supports multiple job steps across nodes (which we will hopefully extend to new features in the future), I dare say a cluster without a SLURM step daemon to register to (hence srun on the compute nodes) is not a functional SLURM cluster.

Besides, I fear it is not feasible to program exceptions to this kind (testing for the availability of srun on the compute nodes) in the jobstep-executor plugin, for all kinds of setups. You could, however, patch the jobstep-executor and omit the srun lines. Note, that it might break portability of workflows in the future if you choose to do so.

Perhaps you can approach your admins about this?

iimog commented 6 months ago

Thank you very much for the explanation. I informed our admin. Hopefully this can be fixed on the cluster configuration side. Else I'll consider patching and pinning the plugin. But I also consider this solution suboptimal. In any case thank you for this plugin and the great support.

cmeesters commented 6 months ago

You are most welcome - hopefully a next issue can be resolved on our side.