snakemake / snakemake-executor-plugin-slurm

A Snakemake executor plugin for submitting jobs to a SLURM cluster
MIT License
18 stars 19 forks source link

threads directive of a rule not taken into account #145

Open ArthurDondi opened 2 months ago

ArthurDondi commented 2 months ago

Software Versions

snakemake 8.18.2 snakemake-executor-plugin-slurm 0.10.0 snakemake-executor-plugin-slurm-jobstep 0.2.1 slurm 23.02.7

Describe the bug The number of threads specified (64) for the rule BaseCellCounter_scDNACalling is not respected. Instead, 4 threads are provided, always, even if I require 1 thread. No idea why 4 in particular. It is similar to #141, but I'm not submitting through bash but through the head node, so I opened a new issue. I tried with threads, cpus-per-task, both, inside the rule, in the profile... Nothing works.

Here is my profile, and you can find the logs below:

executor: slurm
latency-wait: 60
jobs: 500

default-resources:
  cpus_per_task: 64
  mem_mb_per_cpu: 1024
  slurm_account: "'es_beere'"
  tmpdir: "'/scratch'"
  time: 1200
  runtime: 1200
  cores: 400

set-threads:
  BaseCellCounter_scDNACalling: 64
set-resources:
  BaseCellCounter_scDNACalling:
    cpus_per_task: 64

Logs

snakemake -s snakefiles/scDNACalling.smk --configfile config/config_OvCa_LR.yml --profile profile_simple/ --use-conda --latency-wait 30 --show-failed-logs -p --rerun-triggers mtime
Using profile profile_simple/ for setting default command line arguments.
Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
SLURM run ID: 9523c35c-d555-4056-837c-676237a3340f
Using shell: /usr/bin/bash
Provided remote nodes: 500
Job stats:
job                                   count
----------------------------------  -------
BaseCellCalling_step1_scDNACalling        3
BaseCellCalling_step2_scDNACalling        3
BaseCellCalling_step3_scDNACalling        3
BaseCellCounter_scDNACalling              6
MergeCounts_scDNACalling                  3
all_scDNACalling                          1
total                                    19

Select jobs to execute...
Execute 6 jobs...

[Sat Sep  7 22:40:10 2024]
rule BaseCellCounter_scDNACalling:
    input: /cluster/work/bewi/members/dondia/projects/long_reads_tree/LongSom_out/OvCa_LR/scDNACalling/SplitBam/B486.Clone_Tum.bam, /cluster/work/bewi/members/dondia/projects/long_reads_tree/LongSom_out/OvCa_LR/scDNACalling/BaseCellCounter/B486.scRNASites.bed
    output: /cluster/work/bewi/members/dondia/projects/long_reads_tree/LongSom_out/OvCa_LR/scDNACalling/BaseCellCounter/B486/B486.Clone_Tum.tsv, /cluster/work/bewi/members/dondia/projects/long_reads_tree/LongSom_out/OvCa_LR/scDNACalling/BaseCellCounter/B486/temp_Clone_Tum
    jobid: 5
    reason: Missing output files: /cluster/work/bewi/members/dondia/projects/long_reads_tree/LongSom_out/OvCa_LR/scDNACalling/BaseCellCounter/B486/B486.Clone_Tum.tsv
    wildcards: scDNA=B486, clone=Clone_Tum
    threads: 4
    resources: mem_mb=8000, mem_mib=7630, disk_mb=20703, disk_mib=19744, tmpdir=<TBD>, cpus_per_task=64, mem_mb_per_cpu=1024, slurm_account=es_beere, time=1200, runtime=1200, cores=400

Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://c
onda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Using shell: /usr/bin/bash
Provided cores: 64
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=8000, mem_mib=7630, disk_mb=20703, disk_mib=19744, cpus_per_task=64, mem_mb_per_cpu=1024, time=1200, cores=400
Select jobs to execute...
Execute 1 jobs...

Minimal example

rule all:
    input:
        'out.txt'

rule BaseCellCounter_scDNACalling:
    input:
        'in.txt'
    output:
        'out.txt'
    threads: 64
    shell:
        "touch out.txt"

will give:

snakemake -s snakefiles/test.smk --profile profile_simple/ --use-conda --latency-wait 30 --show-failed-logs -p --rerun-triggers mtime
Using profile profile_simple/ for setting default command line arguments.
Building DAG of jobs...
SLURM run ID: fbb1212b-d274-4dc9-af61-9c2b44c5d9d7
Using shell: /usr/bin/bash
Provided remote nodes: 500
Job stats:
job                             count
----------------------------  -------
BaseCellCounter_scDNACalling        1
all                                 1
total                               2

Select jobs to execute...
Execute 1 jobs...

[Sat Sep  7 23:28:04 2024]
rule BaseCellCounter_scDNACalling:
    input: in.txt
    output: out.txt
    jobid: 1
    reason: Missing output files: out.txt
    threads: 4
    resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, cpus_per_task=64, mem_mb_per_cpu=1024, slurm_account=es_beere, time=1200, runtime=1200

Additional context Not related, but any idea why the tmpdir is \<TBD> although I specify it in the profile?

cjops commented 2 months ago

I'm not an author of this plugin but I might be able to help. I think cores: 400 underneath default-resources: isn't doing what you think it's doing. cores isn't the name of one of the Snakemake standard resources or any of the resources recognized by this plugin. Instead, probably you want to move cores: 400 to the global scope of the profile, such as under jobs: 500. There it will set a global maximum number of cores to request at any given time from SLURM (behaving as the Snakemake CLI flag --cores).

However this brings up the annoyance that led me to search this issues page in the first place. If --cores is left unspecified, and even if --jobs is set to a very high number or unlimited, it defaults to the number of available cores on the head node. If you run nproc on your head node, I bet you will get 4; that's where this seemingly arbitrary number is coming from.

I suppose this behavior is in line with a very close reading of the Snakemake CLI documentation, however it was not the behavior prior to Snakemake v8.0.0 and --slurm being spun off into a plugin. Previously you could specify --jobs without --cores and there would be no restriction on total number of cores requested across the cluster. I'm not sure how to get this behavior back, and whether I should raise it as an issue either here or on the main Snakemake repository; either way the maintainers seem overwhelmed at the moment.

ArthurDondi commented 2 months ago

Spot on, thanks a lot! It was the available cores on the head node, and moving cores outside of default-resources: did the trick.

I also saw that the maintainers were overwhelmed, I'll leave this open in case other people also stumble upon this, but feel free to close it.

raphaelbetschart commented 1 month ago

Are you sure that cores sets the global maximum number of cores at any given time? For instance, if I have this Snakefile

CHRS = [ "chr{}".format(x) for x in list(range(1,23)) ]

rule all:
        input:
                expand("resources/test_{chr}.txt", chr = CHRS)

rule TestSLURM:
        output:
                "resources/test_{chr}.txt"
        threads:
                64
        shell:
                """
                echo ${{SLURM_CPUS_PER_TASK}} > {output}
                sleep 60
                """

with this profile

default-resources:
  slurm_partition: "nodes"
  mem_mb: 4000
  runtime: 60
cores: 256
restart-times: 0
max-jobs-per-second: 1
max-status-checks-per-second: 1
local-cores: 1
latency-wait: 5
jobs: 1000000
keep-going: True
rerun-incomplete: True
printshellcmds: True
scheduler: greedy
executor: slurm

it saturates all available nodes (more than the 256 cores). I would expect that at most four TestSLURM jobs can run at the same time.

freekvh commented 1 week ago

I actually do use the "threads", and it is only taken into account when setting (in my config.yaml):

set-threads:
  salmon: 16

When I use set-resources as below:

set-resources:
  salmon:
    threads: 16
    mem_mb: 20000 # This seems to work, but can we do with less?
    runtime: 600

threads is ignored and the default from the rule is used.

cmeesters commented 1 week ago

@freekvh this is intended behaviour according to the docs; scroll down a bit. It is not related to the issues mentioned in this thread. The reason behind this redundant definition is, that the threads parameter can be picked up in the set-resources section to dynamically alter other settings.

Everyone, please open separate issues, when dealing with separate issues. ;-)