Differences in SLURM command between what Snakemake says it runs and what is actually run

snakemake / snakemake-executor-plugin-slurm

A Snakemake executor plugin for submitting jobs to a SLURM cluster

MIT License

18 stars 19 forks source link

Differences in SLURM command between what Snakemake says it runs and what is actually run #57

Open stefmldk opened 7 months ago

stefmldk commented 7 months ago

Dear developers I have a rule that produces unexpected results when submitting to slurm (Snakemake version: 8.10.0). So, I turned on verbose in order to double-check that the command was formatted correctly. According to Snakemake, the following command was submitted to slurm:

gatk SplitIntervals -R some_file       -L some_file        --scatter-count 32       --subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION              -O some_folder 

Job 11 has been submitted with SLURM jobid 633204 (log: /.../633204.log).

Then I opened the referenced 633204.log and here the command looked like:

gatk SplitIntervals -R some_file       -L some_file        --scatter-count 1       --subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION              -O some_folder

Notice how the --scatter-count option has changed from 32 to 1, which explains the unexpected results. This number is based on available cores and is calculated like this in my workflow:

large_job_cores = max(1, math.floor(workflow.cores / len(SAMPLES)))

Since this number has apparently changed between the main workflow and the slurm submission, I wonder if Snakemake is run a second time for each slurm submission? If so, workflow.cores (and maybe other parameters) should be transferred somehow from the main workflow to the slurm submission.

Thank you for all your hard work!

Cheers!

cmeesters commented 7 months ago

Hi,

Thanks for the feedback!

The SLURM executor plugin is oblivious to whatever you write in your Snakefile, except for threads and cpus_per_task (the latter usually defined as a resource in a workflow profile).

So, how is large_job_cores propagated further? Please show us that code to elaborate whether we are dealing with a bug, indeed. Thank you!

stefmldk commented 7 months ago

Thanks, certainly.

large_job_cores is used to control multithreading in rules that launch multi-threaded programs. In addition, parallel computing is also achieved for certain single-threaded jobs by splitting input files in chunks before running them. The latter is what SplitIntervals is for and it creates as many chunks as you instruct it to via the --scatter-count option.

I don't know if it will help to shed some light on this, but here is how large_job_cores is used in that rule:

large_job_cores = max(1, math.floor(workflow.cores / len(SAMPLES)))
INTERVAL_SUBDIVISION_NUMBERS = [str(i).zfill(4) for i in range(large_job_cores )]

rule subdivide_intervals_file:
    input:
        intervals_file = rules.create_intervals_file.output.intervals_file,
        ref = rules.prepare_genome_files.output.fasta_ref,
        ref_index = rules.prepare_genome_files.output.ref_index,
        ref_dict = rules.create_reference_dict.output.ref_dict
    output:
        scattered_interval_lists = expand(os.path.join(intervals_folder,'{subdivision}scattered.interval_list'),subdivision=INTERVAL_SUBDIVISION_NUMBERS)
    conda:
        conda_environment
    params:
        intervals_folder = intervals_folder
    log:
        os.path.join(logs,'subdivide_intervals_file.log')
    shell:
        """
        gatk SplitIntervals \
       -R {input.ref} \
       -L {input.intervals_file} \
       --scatter-count {large_job_cores } \
       --subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION \
       -O {params.intervals_folder} \
       > {log} 2>&1
        """

Since large_job_cores has a direct consequence for the expected output files of this rule, Snakemake throws a missing files error . That is, on a workflow run where large_job_cores evaluates to 32, Snakemake says it submits the correct job command, but as the slurm log shows, this number has been changed to 1. So Snakemake expects 32 files, but only one gets produced. For the rules that runs multithreaded programs, i expect that they will also run with just one core instead of the expected 32. However, no missing files error arise in those cases.

In any case, I expect that differences between what Snakemake says it sends to slurm and what is actually sent is not something one would want. Thank you for looking into this.

Cheers.

cmeesters commented 7 months ago

OK, thank you. This is no bug in the executor, but rather one in your Snakefile. Let me explain:

You instruct your executable to use a number of resources. Yet, upon submission, nothing will be reserved. I suggest changing to either

rule subdivide_intervals_file:
    <snip>
    resources:
        cpus_per_task: large_job_cores
    <snip>

rule subdivide_intervals_file:
    <snip>
    threads: large_job_cores
    <snip>

However, as large_job_cores presumably gets calculated on a login node, you might want to consider that your assumption does not hold, when the number of cores is different there compared to compute nodes. To ensure portability of your development, you better perform the calculation within the rule, e.g.:

rule subdivide_intervals_file:
    <snip>
    threads: max(1, math.floor(workflow.cores / len(SAMPLES)))
    shell: "gatk ... --scatter-count {threads} ..."

I just hope, I did not make a mistake for obviously my advice is untested.

Cheers Christian

stefmldk commented 7 months ago

Thank you, Christian.

I am not sure I agree, though. I believe the error originates (or is exemplified) in the property, workflow.cores, which appears to have two evaluations when running the workflow with the slurm executor:

In the context of my data, the first evaluation seems to be when Snakemake builds the DAG and figures out the names of input and output files. Here it is evaluated to the expected 256 as evident from the verbose output and the list of expected output files from subdivide_intervals_file. This rule, by the way, always runs single-threaded, but the filenames it is supposed to output depend on the evaluation of workflow.cores, which reflects on large_job_cores.

The second evaluation seems to be when the job is submitted to slurm where it appears to take on a lower value. It is possible that it gets calculated on a login node, as you suggest, but if that is indeed the case, isn't that a problem?

If this is the expected behavior, then it means that workflow.cores cannot be used when slurm is used as the executor. I hope this is not the intended behavior, but if it is, it should be clearly indicated in the documentation.

Cheers, Steffen

cmeesters commented 7 months ago

Hi,

I am not sure I agree, though.

Try the following:

#!/bin/bash

#SBATCH ... # all the settings for your job submissions
#SBATCH -c1
echo "I want to run with n-threads, but am confined to ${SLURM_CPUS_PER_TASK} threads."

Whatever is asked for at submit time, is available in the job context - if physically possible and not otherwise restricted by settings. When you calculate your max. number of threads on the login node and not use it for submission (by means of Snakemakes resources), you will not get the resource in the job context. If you recalculate in the job context, depending on the tools you use, you will get different numbers, e.g. the available threads in the c-group, the number of CPUs on the node, etc.. This is not reliable. The workflow.cores-Variable will yield different values, too: Snakemake essentially submits itself. Hence the re-evaluation.

stefmldk commented 7 months ago

Hmm, thanks but I fail to understand how this addresses my disagreement. So let me try to be more specific as to why I disagree and hopefully you will take a another fresh look at this. Thank you for taking the time.

I don't agree that there is a bug in my snakefile. Why? Because it works and runs as expected if I do not reference workflow.cores. Conclusion: workflow.cores is the culprit here and we can sweep everything else off the table and focus on that.

And just to be clear: workflow.cores is mentioned in several places in the main Snakemake documentation as a resource that is available to the user.

Another important remark is this: workflow.cores does NOT represent an evaluation of the available number of cores in the compute environment. Instead, it reflects the requested number when snakemake is called. That is, if I call snakemake like this on a single core machine:

snakemake -s snakefile.smk -c1000

the value of workflow.cores is not 1 but 1000

So, speculations on the influence of the login node can also be set aside.

The question now is this: Why does workflow.cores get redefined, when using slurm as the executor? It shouldn't get redefined.

I don't have any experience with the Snakemake codebase and I am not familiar with dataclasses, but a quick comparison between the code for snakemake-executor-plugin-slurm and the skeleton for executor plugins reveals that the workflow seems to get referenced in the skeleton's __post_init__ function in the Executor, but not in snakemake-executor-plugin-slurm.

Thus, I may be completely wrong but based on this, my best guess is that workflow.cores is undefined and therefore, a default value of 1 is used.

Cheers.

cmeesters commented 7 months ago

Sorry, I should have been more elaborate:

The idea is to trigger a workflow run on the login node. This is the first invocation of Snakemake. workflow.cores is evaluated to a number. Some jobs, those not defined as localrules, get submitted. This is where the jobstep-executor plugin kicks in. It is the second invocation of Snakemake per (group) job. workflow.cores is re-evaluated.

Now, cluster admins might treat this differently, but the default is, that SLURM sets it c-group according to values present at submit time (we will not consider a case where a job is rejected due to faulty parameters). Basically, there are two scenarios:

the c-group is most likely smaller than your "allowance" on the login node (for a shared memory application is cpus_per_task big, which is equivalent to threads - any setting is fine). Hence, the apparent(!) core count based on workflow.cores will be lowered.
if the calculation is not based on the c-group, the compute node might still be different in terms of core number (compared to the login node - let's disregard the little but important difference of cores and CPUs for this discussion).

In both cases, we suffer from a deviation for the apparent cores. As your count for workflow.cores is 1 in job context, I presumed you suffer from scenario one.

This is why upon workflow design we need to ensure that whatever we want to use in terms of resources needs to be reserved. And it is the reason, too, why we ought to store any resource calculation as a parameter of the rule.

But yes, this limits the use case of workflow.cores somewhat. I will put it in the documentation of this plugin!

stefmldk commented 1 month ago

I wonder if this has been fixed via this fix?