Define threads for Nextclade rules

joverlee521 commented 4 months ago

Prompted by https://github.com/nextstrain/ncov-ingest/issues/456#issuecomment-2218083580

From my reading of Snakemake docs on threads:

Also, setting a threads: maximum is required to achieve parallelism in tools that (often implicitly and without the user knowing) rely on an environment variable for the maximum of cores to use.

Makes me think we need to define threads of the Nextclade Snakemake rules in order for Nextclade to parallelize.

Proposing that we split the workflow cores between the two Nextclade rules:

diff --git a/workflow/snakemake_rules/nextclade.smk b/workflow/snakemake_rules/nextclade.smk
index 8d24c23..e97c8aa 100644
--- a/workflow/snakemake_rules/nextclade.smk
+++ b/workflow/snakemake_rules/nextclade.smk
@@ -193,6 +193,8 @@ rule run_wuhan_nextclade:
             temp(f"data/{database}/nextclade.translation_{gene}.upd.fasta")
             for gene in GENE_LIST
         ],
+    threads:
+        workflow.cores * 0.5
     benchmark:
         f"benchmarks/run_wuhan_nextclade_{database}.txt"
     shell:
@@ -224,6 +226,8 @@ rule run_21L_nextclade:
         sequences=f"data/{database}/nextclade_21L.sequences.fasta",
     output:
         info=f"data/{database}/nextclade_21L_new_raw.tsv",
+    threads:
+        workflow.cores * 0.5
     benchmark:
         f"benchmarks/run_21L_nextclade_{database}.txt"
     shell:

corneliusroemer commented 4 months ago

Good idea but I don't think we need to set these threads. All they do, in my experience, is limit parallelism.

Have you checked that there's a need for this? I'm pretty sure things have always been running in parallel just fine.

Nextclade uses all cores by default, unless you tell it through -j otherwise. It doesn't read env variables AFAIK

joverlee521 commented 4 months ago

Have you checked that there's a need for this? I'm pretty sure things have always been running in parallel just fine.

Hmm, I'm checking with a small example below. Note that rule a does not define threads and is defaulting to 1 core.

rule all:
    input:
        b = "b.txt",
        c = "c.txt"

rule a:
    output: touch("a.txt")
    shell:
        """
        echo rule a nproc "$(nproc)"
        """

rule b:
    input: "a.txt"
    output: touch("b.txt")
    threads: workflow.cores * 0.5
    shell:
        """
        echo rule b nproc "$(nproc)"
        """

rule c:
    input: "a.txt"
    output: touch("c.txt")
    threads: 2
    shell:
        """
        echo rule c nproc "$(nproc)"
        """

Running with nextstrain build --cpus 8

$ nextstrain build --cpus 8 snakemake/threads/ --forceall --quiet
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 8
Rules claiming more threads will be scaled down.
Job stats:
job      count
-----  -------
a            1
all          1
b            1
c            1
total        4

Select jobs to execute...

        echo rule a nproc "$(nproc)"

rule a nproc 1
Touching output file a.txt.
Select jobs to execute...

        echo rule b nproc "$(nproc)"

        echo rule c nproc "$(nproc)"

rule b nproc 4
Touching output file b.txt.
rule c nproc 2
Touching output file c.txt.
Select jobs to execute...
Complete log: .snakemake/log/2024-07-09T175422.806795.snakemake.log

corneliusroemer commented 4 months ago

I still don't think this is necessary. I've never used nproc and it seems to report something that nextclade doesn't care about - at least on macOS.

See how these three rules run all in parallel, even though one doesn't mention threads at all.

They all run the same time, using as many jobs as there are cores (unless limited by -j). So the threads variable has no effect other than limiting parallelness.

Though it does indeed seem to set some env variable that makes nproc report a reduced number of cores. But this is not done through cgroups, so probably has no effect on programs that don't look at those env variables.

Snakefile:

rule all:
    input:
        a="a.txt",
        c="c.txt",
        d="d.txt",

rule a:
    input:
        "input.fasta",
    output:
        touch("a.txt"),
    shell:
        """
        echo rule a nproc "$(nproc)"
        time nextclade run -d nextstrain/mpox {input} --output-all nextclade_a
        """

rule c:
    input:
        "input.fasta",
    output:
        touch("c.txt"),
    threads: 4
    shell:
        """
        echo rule c nproc "$(nproc)"
        time nextclade run -d nextstrain/mpox {input} --output-all nextclade_c
        """

rule d:
    input:
        "input.fasta",
    output:
        touch("d.txt"),
    threads: 4
    shell:
        """
        echo rule d nproc "$(nproc)"
        time nextclade run -j {threads} -d nextstrain/mpox {input} --output-all nextclade_d
        """

Output:

$ snakemake --force
Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
Using shell: /opt/homebrew/bin/bash
Provided cores: 10
Rules claiming more threads will be scaled down.
Job stats:
job      count
-----  -------
a            1
all          1
c            1
d            1
total        4

Select jobs to execute...
Execute 3 jobs...

[Tue Jul  9 20:22:54 2024]
localrule a:
    input: input.fasta
    output: a.txt
    jobid: 1
    reason: Forced execution
    resources: tmpdir=/var/folders/qf/4kkcfypx0gbfb0t9336522_r0000gn/T

[Tue Jul  9 20:22:54 2024]
localrule d:
    input: input.fasta
    output: d.txt
    jobid: 3
    reason: Forced execution
    threads: 4
    resources: tmpdir=/var/folders/qf/4kkcfypx0gbfb0t9336522_r0000gn/T

[Tue Jul  9 20:22:54 2024]
rule a nproc 1
localrule c:
    input: input.fasta
    output: c.txt
    jobid: 2
    reason: Forced execution
    threads: 4
    resources: tmpdir=/var/folders/qf/4kkcfypx0gbfb0t9336522_r0000gn/T

rule d nproc 4
rule c nproc 4

real    0m3.644s
user    0m9.996s
sys     0m0.405s
Touching output file a.txt.
[Tue Jul  9 20:22:58 2024]
Finished job 1.
1 of 4 steps (25%) done

real    0m3.654s
user    0m9.974s
sys     0m0.423s
Touching output file c.txt.
[Tue Jul  9 20:22:58 2024]
Finished job 2.
2 of 4 steps (50%) done

real    0m4.888s
user    0m9.459s
sys     0m0.376s
Touching output file d.txt.
[Tue Jul  9 20:22:59 2024]
Finished job 3.
3 of 4 steps (75%) done
Select jobs to execute...
Execute 1 jobs...

[Tue Jul  9 20:22:59 2024]
localrule all:
    input: a.txt, c.txt, d.txt
    jobid: 0
    reason: Forced execution
    resources: tmpdir=/var/folders/qf/4kkcfypx0gbfb0t9336522_r0000gn/T

[Tue Jul  9 20:22:59 2024]
Finished job 0.
4 of 4 steps (100%) done
Complete log: .snakemake/log/2024-07-09T202254.489461.snakemake.log

Note that the slowest is the one that gets 4 threads but limits nextclade jobs via -j 4.

This is slower than not defining threads and not passing jobs for nextclade. So the addition of jobs doesn't make anything faster, just adds lines of codes.

corneliusroemer commented 4 months ago

I outputted env for each rule, and these are the differences:

$ diff env_a.txt env_d.txt 
10c10
< NUMEXPR_NUM_THREADS=1
---
> NUMEXPR_NUM_THREADS=4
28c28
< OPENBLAS_NUM_THREADS=1
---
> OPENBLAS_NUM_THREADS=4
40,41c40,41
< VECLIB_MAXIMUM_THREADS=1
< GOTO_NUM_THREADS=1
---
> VECLIB_MAXIMUM_THREADS=4
> GOTO_NUM_THREADS=4
63c63
< OMP_NUM_THREADS=1
---
> OMP_NUM_THREADS=4
69c69
< MKL_NUM_THREADS=1
---
> MKL_NUM_THREADS=4

So it looks like nproc reports what one of these variables says, but these variables are ignored by Nextclade and hence threads only has the effect of limiting parallelity (as the default is 1, if ommitted)

joverlee521 commented 4 months ago

Hmm, I'm getting confused. Isn't there two levels of parallelism going on?

Parallel Snakemake rules
Parallel Nextclade processes within a Snakemake rule

As you stated, [1] is limited by threads. However I thought [2] relies on Snakemake reserving the number of cores defined by threads for the single rule? Or am I completely misunderstanding how Snakemake works under the hood?

joverlee521 commented 4 months ago

Or am I completely misunderstanding how Snakemake works under the hood?

Chatted with @tsibley about this in our 1:1 today and I am misunderstanding. Snakemake's threads are just "guidelines" where we use it, so it does need to be passed to nextclade run -j {threads} to be respected by Nextclade. Just defining threads is not enough as Nextclade would still oversubscribe.

I am still curious whether limiting the threads will be faster than letting the two nextclade jobs compete for resources, so I'll update the workflow to pass in threads to Nextclade tomorrow and see how that affects the runtime.

joverlee521 commented 3 months ago

I am still curious whether limiting the threads will be faster than letting the two nextclade jobs compete for resources, so I'll update the workflow to pass in threads to Nextclade tomorrow and see how that affects the runtime.

I took a quick look at run times for this week after adding threads. The runtime when using the Nextclade cache is fast enough that there's no noticeable difference.

However, there is a difference in a full Nextclade run. Comparing just the run_wuhan_nextclade job runtimes:

2024-07-08's full run without threads (snakemake_stats.json): 8hrs 54mins (32098s)
2024-07-17's full run with threads (snakemake_stats.json): 5hrs 16mins (18997s)

joverlee521 commented 3 months ago

With plans to ignore cache with new versions, I think we will do full runs more often and so it makes sense to keep the defined threads.

nextstrain / ncov-ingest

Define threads for Nextclade rules #459