Closed Robin1em closed 6 months ago
Hi,
srun
is not found?? Usually, you should see that jobs get submitted, do you see this?
Does sbatch hello_world.sh
work with hello_world.sh
being:
#!/bin/bash
srun echo "Hello World from $(hostname)"
Usually, there should be pragmas like #SBATCH -p <partition>
, but I assume your cluster has a default account and default partition, because it is omitted in the shown command line of yours.
Hi, I'm working with @Robin1em on this project, thank you for your quick response. Indeed, the jobs get submitted (squeue
shows them) but fail after a couple of seconds. The srun
not found error is in the individual log files.
The setup of our cluster consists of a single node, that we connect to submit our jobs with sbatch
or srun
. The job is then executed on one of several compute nodes. Submitting jobs from the compute nodes is not supported.
Thus, when running your hello_world.sh
via sbatch
the slurm log shows:
/var/spool/slurmd/job42513/slurm_script: line 3: srun: command not found
When omitting the srun
in the script, everything works as expected and the hostname of the compute node is printed. Is it necessary to use srun
on the compute nodes?
PS: Yes, the cluster has a default account and partition, so we can submit without specifying these.
Hi,
Is it necessary to use srun on the compute nodes?
As srun
is a) the default MPI starter under SLURM and b) recent versions of SLURM confine c-groups to a default of 1 core (see issue #41) and c) it helps to differentiate error sources (the SLURM script or a job step) and d) SLURM supports multiple job steps across nodes (which we will hopefully extend to new features in the future), I dare say a cluster without a SLURM step daemon to register to (hence srun
on the compute nodes) is not a functional SLURM cluster.
Besides, I fear it is not feasible to program exceptions to this kind (testing for the availability of srun
on the compute nodes) in the jobstep-executor plugin, for all kinds of setups. You could, however, patch the jobstep-executor and omit the srun
lines. Note, that it might break portability of workflows in the future if you choose to do so.
Perhaps you can approach your admins about this?
Thank you very much for the explanation. I informed our admin. Hopefully this can be fixed on the cluster configuration side. Else I'll consider patching and pinning the plugin. But I also consider this solution suboptimal. In any case thank you for this plugin and the great support.
You are most welcome - hopefully a next issue can be resolved on our side.
When I try to run a snakefile with slurm, the jobs get submitted but stop after 3-4 seconds without creating the desired output.
The snakefile has the following content:
I run it with this command:
The log looks as follows and it seems like snakemake wants to use „srun“ after submitting the jobs but can’t find the command, although it is there.