karel-brinda / Phylign

Alignment against all pre-2019 bacteria on laptops within a few hours (former MOF-Search)
http://brinda.eu/mof
Other
25 stars 4 forks source link

Number of download threads reportedly being 1 on some machines regardless of the configuration #242

Closed karel-brinda closed 9 months ago

karel-brinda commented 9 months ago

According to one of the reviewers, the number of download threads they could use was always 1, which made the download extremely slow for them.

The only explanation I can think of is that they used some virtual machine with only 1 core or a slurm job with 1 CPU. In such a case, Snakemake doesn't go above #cores I guess.

Is there any way how to fix this? The number of used download threads should be independent of the number of assigned cores.

What do you think @leoisl ?

karel-brinda commented 9 months ago

@leoisl Are you able to reproduce this somehow?

leoisl commented 9 months ago

I am not sure, all I could think is a misconfiguration when setting up the slurm profile. I can test on EBI slurm cluster, will do it today and report back

karel-brinda commented 9 months ago

This is the relevant comment:

  1. Needed to make download again (and wait many hours for things to download; not clear how to use more than one thread given the config.yaml file’s max_download_threads doesn’t affect the make download’s threads utilized.
leoisl commented 9 months ago

Yeah, I don't understand well, but will try to debug. First thing is can we change the submission to the slurm cluster? Currently we have this: https://github.com/karel-brinda/mof-search/blob/87237f4e2ababc96840066db03a39ec28839452f/Makefile#L98-L104

It has a fixed number of cores (10) which should be manually synchronised with https://github.com/karel-brinda/mof-search/blob/87237f4e2ababc96840066db03a39ec28839452f/config.yaml#L51

Also submission will fail if a priority partition does not exist and pipeline will fail if it takes longer than 8 hours.

Would it be ok for the slurm run to be like the lsf run, i.e. we tell snakemake that the executor is slurm and let it manage the jobs?

karel-brinda commented 9 months ago

Slurm is currently not the issue, and we can fix Slurm submission system later. It's just one example, where I can imagine this phenomenon might be theoretically encountered (and not with this submission script – someone would have to submit it with only 1 core).

leoisl commented 9 months ago

If the user ran make cluster_slurm and did not change the config.yaml, what we would have is actually the opposite effect - mof-search would use more cores than the 10 cores given to the slurm job because by default threads: all in config.yaml - snakemake would not limit itself to 10 cores only, but would use all cores in the worker node, which is a bug with make cluster_slurm. I can't actually see how we could be limited to a single download thread if in config.yaml threads is not 1 or max_download_threads is not 1.

I think we don't have enough information to properly debug this - we would need the command ran and the full config.yaml...

karel-brinda commented 9 months ago

What happens if you run snakemake -j8 on a computer with only 1 CPU? Will Snakemake use 8 threads or just 1?

leoisl commented 9 months ago

8

This is my test:

Snakefile:

rule all:
    input:
        [f"{i}.txt" for i in range(1000)]

rule touch:
    output: "{i}.txt"
    shell: "sleep 100000; touch {output}"

Command (this is ran on my local laptop, which has 8 cores, but I tell snakemake to use 1k cores):

snakemake -j 1000

1000 touch jobs are running simultaneously:

$ ps aux | grep sleep | grep touch | wc -l
1000
karel-brinda commented 9 months ago

OK; thanks for the test! I will answer we're unable to identify this issue.

karel-brinda commented 9 months ago

Closing it for now. Will reopen this in future if we manage to reproduce this.

karel-brinda commented 9 months ago

@leoisl I've just actually observed exactly the same issue on GenOuest.

Running make download_asms invokes the following Snakemake command:

snakemake download_asms_batches --cores all --rerun-incomplete --printshellcmds --keep-going --use-conda --resources max_download_threads=80 max_io_heavy_threads=8 max_ram_mb=12288 -j 99999

If the number of allocated SLURM cores is 2, even with 80 pre-specified download threads, --cores all will cause downscaling to 2 download threads only.

karel-brinda commented 9 months ago

@leoisl Could you please look into how to fix this? I think it shouldn't be hard.