Closed joverlee521 closed 4 months ago
Good idea but I don't think we need to set these threads. All they do, in my experience, is limit parallelism.
Have you checked that there's a need for this? I'm pretty sure things have always been running in parallel just fine.
Nextclade uses all cores by default, unless you tell it through -j
otherwise. It doesn't read env variables AFAIK
Have you checked that there's a need for this? I'm pretty sure things have always been running in parallel just fine.
Hmm, I'm checking with a small example below. Note that rule a
does not define threads and is defaulting to 1 core.
rule all:
input:
b = "b.txt",
c = "c.txt"
rule a:
output: touch("a.txt")
shell:
"""
echo rule a nproc "$(nproc)"
"""
rule b:
input: "a.txt"
output: touch("b.txt")
threads: workflow.cores * 0.5
shell:
"""
echo rule b nproc "$(nproc)"
"""
rule c:
input: "a.txt"
output: touch("c.txt")
threads: 2
shell:
"""
echo rule c nproc "$(nproc)"
"""
Running with nextstrain build --cpus 8
$ nextstrain build --cpus 8 snakemake/threads/ --forceall --quiet
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 8
Rules claiming more threads will be scaled down.
Job stats:
job count
----- -------
a 1
all 1
b 1
c 1
total 4
Select jobs to execute...
echo rule a nproc "$(nproc)"
rule a nproc 1
Touching output file a.txt.
Select jobs to execute...
echo rule b nproc "$(nproc)"
echo rule c nproc "$(nproc)"
rule b nproc 4
Touching output file b.txt.
rule c nproc 2
Touching output file c.txt.
Select jobs to execute...
Complete log: .snakemake/log/2024-07-09T175422.806795.snakemake.log
I still don't think this is necessary. I've never used nproc and it seems to report something that nextclade doesn't care about - at least on macOS.
See how these three rules run all in parallel, even though one doesn't mention threads at all.
They all run the same time, using as many jobs as there are cores (unless limited by -j
). So the threads variable has no effect other than limiting parallelness.
Though it does indeed seem to set some env variable that makes nproc
report a reduced number of cores. But this is not done through cgroups, so probably has no effect on programs that don't look at those env variables.
Snakefile:
rule all:
input:
a="a.txt",
c="c.txt",
d="d.txt",
rule a:
input:
"input.fasta",
output:
touch("a.txt"),
shell:
"""
echo rule a nproc "$(nproc)"
time nextclade run -d nextstrain/mpox {input} --output-all nextclade_a
"""
rule c:
input:
"input.fasta",
output:
touch("c.txt"),
threads: 4
shell:
"""
echo rule c nproc "$(nproc)"
time nextclade run -d nextstrain/mpox {input} --output-all nextclade_c
"""
rule d:
input:
"input.fasta",
output:
touch("d.txt"),
threads: 4
shell:
"""
echo rule d nproc "$(nproc)"
time nextclade run -j {threads} -d nextstrain/mpox {input} --output-all nextclade_d
"""
Output:
$ snakemake --force
Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
Using shell: /opt/homebrew/bin/bash
Provided cores: 10
Rules claiming more threads will be scaled down.
Job stats:
job count
----- -------
a 1
all 1
c 1
d 1
total 4
Select jobs to execute...
Execute 3 jobs...
[Tue Jul 9 20:22:54 2024]
localrule a:
input: input.fasta
output: a.txt
jobid: 1
reason: Forced execution
resources: tmpdir=/var/folders/qf/4kkcfypx0gbfb0t9336522_r0000gn/T
[Tue Jul 9 20:22:54 2024]
localrule d:
input: input.fasta
output: d.txt
jobid: 3
reason: Forced execution
threads: 4
resources: tmpdir=/var/folders/qf/4kkcfypx0gbfb0t9336522_r0000gn/T
[Tue Jul 9 20:22:54 2024]
rule a nproc 1
localrule c:
input: input.fasta
output: c.txt
jobid: 2
reason: Forced execution
threads: 4
resources: tmpdir=/var/folders/qf/4kkcfypx0gbfb0t9336522_r0000gn/T
rule d nproc 4
rule c nproc 4
real 0m3.644s
user 0m9.996s
sys 0m0.405s
Touching output file a.txt.
[Tue Jul 9 20:22:58 2024]
Finished job 1.
1 of 4 steps (25%) done
real 0m3.654s
user 0m9.974s
sys 0m0.423s
Touching output file c.txt.
[Tue Jul 9 20:22:58 2024]
Finished job 2.
2 of 4 steps (50%) done
real 0m4.888s
user 0m9.459s
sys 0m0.376s
Touching output file d.txt.
[Tue Jul 9 20:22:59 2024]
Finished job 3.
3 of 4 steps (75%) done
Select jobs to execute...
Execute 1 jobs...
[Tue Jul 9 20:22:59 2024]
localrule all:
input: a.txt, c.txt, d.txt
jobid: 0
reason: Forced execution
resources: tmpdir=/var/folders/qf/4kkcfypx0gbfb0t9336522_r0000gn/T
[Tue Jul 9 20:22:59 2024]
Finished job 0.
4 of 4 steps (100%) done
Complete log: .snakemake/log/2024-07-09T202254.489461.snakemake.log
Note that the slowest is the one that gets 4 threads but limits nextclade jobs via -j 4
.
This is slower than not defining threads and not passing jobs for nextclade. So the addition of jobs doesn't make anything faster, just adds lines of codes.
I outputted env
for each rule, and these are the differences:
$ diff env_a.txt env_d.txt
10c10
< NUMEXPR_NUM_THREADS=1
---
> NUMEXPR_NUM_THREADS=4
28c28
< OPENBLAS_NUM_THREADS=1
---
> OPENBLAS_NUM_THREADS=4
40,41c40,41
< VECLIB_MAXIMUM_THREADS=1
< GOTO_NUM_THREADS=1
---
> VECLIB_MAXIMUM_THREADS=4
> GOTO_NUM_THREADS=4
63c63
< OMP_NUM_THREADS=1
---
> OMP_NUM_THREADS=4
69c69
< MKL_NUM_THREADS=1
---
> MKL_NUM_THREADS=4
So it looks like nproc reports what one of these variables says, but these variables are ignored by Nextclade and hence threads
only has the effect of limiting parallelity (as the default is 1, if ommitted)
Hmm, I'm getting confused. Isn't there two levels of parallelism going on?
As you stated, [1] is limited by threads. However I thought [2] relies on Snakemake reserving the number of cores defined by threads for the single rule? Or am I completely misunderstanding how Snakemake works under the hood?
Or am I completely misunderstanding how Snakemake works under the hood?
Chatted with @tsibley about this in our 1:1 today and I am misunderstanding. Snakemake's threads are just "guidelines" where we use it, so it does need to be passed to nextclade run -j {threads}
to be respected by Nextclade. Just defining threads is not enough as Nextclade would still oversubscribe.
I am still curious whether limiting the threads will be faster than letting the two nextclade jobs compete for resources, so I'll update the workflow to pass in threads to Nextclade tomorrow and see how that affects the runtime.
I am still curious whether limiting the threads will be faster than letting the two nextclade jobs compete for resources, so I'll update the workflow to pass in threads to Nextclade tomorrow and see how that affects the runtime.
I took a quick look at run times for this week after adding threads. The runtime when using the Nextclade cache is fast enough that there's no noticeable difference.
However, there is a difference in a full Nextclade run. Comparing just the run_wuhan_nextclade
job runtimes:
With plans to ignore cache with new versions, I think we will do full runs more often and so it makes sense to keep the defined threads.
Prompted by https://github.com/nextstrain/ncov-ingest/issues/456#issuecomment-2218083580
From my reading of Snakemake docs on threads:
Makes me think we need to define threads of the Nextclade Snakemake rules in order for Nextclade to parallelize.
Proposing that we split the workflow cores between the two Nextclade rules: