Closed paudano closed 5 months ago
Thank you for bringing this issue to our attention! This is a serious change in SchedMD's interface. I did not notice this because we are just in the process of setting up our new cluster and are so far a version or two behind the release schedule on our current one.
Even if I override this and tell a command to use a hard-coded number of threads (not relying on the rule's threads value), the job still only uses one core.
This is logical because SLURM sets the c-group to be one core and all threads are confined to this c-group.
I will attempt an urgent fix!
Thanks Christian! If it helps, I'll try to setup a test environment on my end to verify a fix before it's released. Yes, the cgroup constraint makes sense.
Actually: two fixes are needed - one in the jobstep executor (busy with this one, but the test cases bug me) and this executor.
@paudano Please install the newest releases of the snakemake-executor-plugin-slurm
and snakemake-executor-plugin-slurm-jobstep
. They will be available in Bioconda, shortly.
Thank you! I was able to verify on our Slurm system (23.02.7).
I confirmed jobs are getting getting assigned multiple CPUs when requested, that processes inside the job were able to use multiple CPUs (pigz uses ~ 400% CPU in top with 4 threads).
snakemake 8.10.8 hdfd78af_0 bioconda
snakemake-executor-plugin-slurm 0.4.5 pyhdfd78af_0 bioconda
snakemake-executor-plugin-slurm-jobstep 0.2.1 pyhdfd78af_0 bioconda
snakemake-interface-common 1.17.2 pyhdfd78af_0 bioconda
snakemake-interface-executor-plugins 9.1.1 pyhdfd78af_0 bioconda
snakemake-interface-report-plugins 1.0.0 pyhdfd78af_0 bioconda
snakemake-interface-storage-plugins 3.2.2 pyhdfd78af_0 bioconda
snakemake-minimal 8.10.8 pyhdfd78af_0 bioconda
python 3.11.9 hb806964_0_cpython conda-forge
Hugh? For optimal functionality, you need to update to python >= 3.12. I am surprised, you do not see further issues with 3.11.
Anyway, thanks for the feedback!
Thanks for the heads-up, I was having some compatibility issues with other packages and 3.12 a couple months ago, but probably worth trying again.
On our cluster, Slurm man pages say:
When I submit multi-core jobs through Snakemake, it looks like they get all the requested cores (squeue), but they behave as if they have a single core. In the job,
threads
is set to 1. Even if I override this and tell a command to use a hard-coded number of threads (not relying on the rule'sthreads
value), the job still only uses one core.I think this might be the cause of Snakemake issue #2447.
I created a test case where I'm piping /dev/random through pigz and capturing the top output (Snakemake file and profiles directory): Snakefile.zip profiles.zip
When this command runs, I get (Long string redacted with XXX):
sbatch call: sbatch --job-name XXX --output XXX/%j.log --export=ALL --comment rule_a -A XXX -p XXX -t 12 --mem 512 --cpus-per-task=4 -D XXX --wrap="XXX/python3.11 -m snakemake --snakefile 'XXX/Snakefile' --target-jobs 'rule_a:' --allowed-rules 'rule_a' --cores 'all' --attempt 1 --force-use-threads --resources 'mem_mb=512' 'mem_mib=489' 'disk_mb=1000' 'disk_mib=954' --wait-for-files 'XXX/.snakemake/tmp.uq8q7p21' --force --target-files-omit-workdir-adjustment --keep-storage-local-copies --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --verbose --rerun-triggers code mtime input params software-env --conda-frontend 'mamba' --shared-fs-usage input-output sources software-deployment source-cache persistence storage-local-copies --wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' --set-threads 'rule_a=4' --latency-wait 5 --scheduler 'greedy' --scheduler-solver-path 'XXX/bin' --default-resources 'mem_mb=512' 'disk_mb=max(2*input.size_mb, 1000)' 'tmpdir=system_tmpdir' 'runtime=12' --executor slurm-jobstep --jobs 1 --mode 'remote'"
From discussions I've had with our cluster admins, it sounds like Snakemake should be running "srun" inside the
--wrap
argument (something like--wrap="srun SRUN_PARAMS ... XXX/python3.11 -m snakemake ..."
).Thank you!