sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
456 stars 78 forks source link

Resource requirements for `sketchall` plugin #2748

Open ccbaumler opened 10 months ago

ccbaumler commented 10 months ago

While it may be self-evident the memory and cpu resources required by the new sketchall plugin for sourmash, the time it requires to run is less so.

When using this multi-threaded approach on large files/making large databases allow for a long processing time (think days instead of hours). I will report back some benchmarks. In the mean time, here is what I have found:

I was using 86 metagenomes from Petabyte Scale Sequence Search: Metagenomics Benchmarking Codeathon. The smallest of these is 3.3G and the largest is 41G.

When running in a workflow, the job would end before the output was complete (see code block for the resource report). I have reset the the workflow job to run for three days.

    threads: 64
    resources: mem_mb=40960, mem_mib=39063, disk_mb=2891102, disk_mib=2757170, tmpdir=/tmp, time=600, partition=high2, nodes=1, runtime=600, allowed_jobs=100

        sourmash scripts sketchall fastq/ -p scaled=1000,k=31,k=51,scaled=1000,abund -j 64 -o manysigs/

Activating conda environment: ../../miniconda3/bd637570928d6f69182785adf8d9fd98_
^MESC[K
== This is sourmash version 4.8.3. ==
^MESC[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

^MESC[Kfinding all input files under 'fastq/' with pattern '*'
^MESC[KStarting to sketch 86 files with 64 threads.
ccbaumler commented 10 months ago

Ran into a out of memory error while running. I will try this again with 50G instead of 8G.

Resources:

threads: 64
resources: mem_mb=8192, mem_mib=7813, disk_mb=2891102, disk_mib=2757170, tmpdir=<TBD>, time=4320, partition=high2, nodes=1, runtime=4320

Error:

Some of your processes may have been killed by the cgroup out-of-memory handler.
ccbaumler commented 10 months ago

Interesting development with the sketchall plugin, upon completion of the set of signatures the sketchall command starts over and places a second sketch in the signature file.

    input:
        fastq = expand("fastq/{run_id}.fastq", run_id=run_ids),
    output:
        all_sigs = expand("manysigs/{run_id}.fastq.zip", run_id=run_ids),
    params:
        manysigs = "manysigs/",
        fastq = "fastq/",
        k_list = lambda wildcards: ",".join([f"k={k}" for k in config["k-size"]]),
        scale = config.get("scaled-value")
    log:
        "logs/sourmash_script.log"
    conda:
        "envs/download-sketch-env.yaml"
    shell:
        """
        sourmash scripts sketchall {params.fastq} -p scaled={params.scale},{params.k_list},abund -j {threads} -o {params.manysigs}
 sourmash sig fileinfo manysigs/SRR8849274.fastq.zip                                                                                                   

== This is sourmash version 4.8.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'manysigs/SRR8849274.fastq.zip'
path filetype: ZipFileLinearIndex
location: /home/baumlerc/download-seq/download-seq/manysigs/SRR8849274.fastq.zip
is database? yes
has manifest? yes
num signatures: 6
** examining manifest...
total hashes: 5508866
summary of sketches:
   2 sketches with DNA, k=21, scaled=1000, abund      1437902 total hashes
   2 sketches with DNA, k=31, scaled=1000, abund      1764610 total hashes
   2 sketches with DNA, k=51, scaled=1000, abund      2306354 total hashes
ctb commented 10 months ago

could you report what command is actually run by snakemake? e.g. output of snakemake -p? thanks!

ctb commented 10 months ago

(you can do snakemake -p -n ... if you don't want to re-run the actual command)

ccbaumler commented 10 months ago

Sure, I actually haven't touched the tmux pane this workflow died in...

    threads: 64
    resources: mem_mb=51200, mem_mib=48829, disk_mb=<TBD>, tmpdir=<TBD>, time=4320, partition=high2, nodes=1, attempt_cnt=6, runtime=4320

        sourmash scripts sketchall fastq/ -p scaled=1000,k=21,k=31,k=51,abund -j 64 -o manysigs/
ctb commented 10 months ago

I went to take a look on farm and it seems like you deleted the output already?

I am curious about the contents of the fastq/ directory and also what sourmash sig describe reports. In particular, I wanted to know if the pairs of sketches were in fact duplicates (same md5sum).

If you get to this point again, let me know pls! Alternatively, if you can reproduce this with smaller files, that'd be great :).

ccbaumler commented 10 months ago

Sorry, had to clear out disk space for other work. I'll generate some new signatures to inspect and let you know.

ccbaumler commented 10 months ago

When workflow was allowed to run for ~1hr30min:

sourmash sig fileinfo manysigs/SRR13122219.fastq.zip

== This is sourmash version 4.8.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'manysigs/SRR13122219.fastq.zip'
path filetype: ZipFileLinearIndex
location: /home/baumlerc/download-seq/download-seq/manysigs/SRR13122219.fastq.zip
is database? yes
has manifest? yes
num signatures: 9
** examining manifest...
total hashes: 10312863
summary of sketches:
   3 sketches with DNA, k=21, scaled=1000, abund      3015327 total hashes
   3 sketches with DNA, k=31, scaled=1000, abund      3418467 total hashes
   3 sketches with DNA, k=51, scaled=1000, abund      3879069 total hashes
sourmash sig describe manysigs/SRR13122219.fastq.zip

== This is sourmash version 4.8.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

---
signature filename: /home/baumlerc/download-seq/download-seq/manysigs/SRR13122219.fastq.zip
signature: SRR13122219.7524428 7524428 length=302
source file: fastq/SRR13122219.fastq
md5: 7a313940b4a3cc133d92f630ab3033d4
k=21 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=1
size: 1005109
sum hashes: 1975306
signature license: CC0

---
signature filename: /home/baumlerc/download-seq/download-seq/manysigs/SRR13122219.fastq.zip
signature: SRR13122219.7524428 7524428 length=302
source file: fastq/SRR13122219.fastq
md5: 2dd101186ac9e2762a4f3b193021a8aa
k=31 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=1
size: 1139489
sum hashes: 1928606
signature license: CC0

---
signature filename: /home/baumlerc/download-seq/download-seq/manysigs/SRR13122219.fastq.zip
signature: SRR13122219.7524428 7524428 length=302
source file: fastq/SRR13122219.fastq
md5: 0be1de4b798342b24cc39f8ab750e822
k=51 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=1
size: 1293023
sum hashes: 1853767
signature license: CC0

loaded 3 signatures total, from 1 files
ctb commented 10 months ago

(sig files are not readable by anyone other than you - can you chmod a+r them please? also - what do you expect to have happen here? I do see some oddities, but wondering what you wanted to see :)

ccbaumler commented 10 months ago

I have changed the permissions.

For posterity, The files change permission throughout the sketchall process:

baumlerc@farm:~/download-seq/download-seq$ ls -l manysigs/
total 196629
-rw-rw-r-- 1 baumlerc baumlerc 66234192 Sep 19 10:27 SRR12480103.fastq.zip
-rw-rw-r-- 1 baumlerc baumlerc 26728996 Sep 19 10:17 SRR13122219.fastq.zip
-rw-rw-r-- 1 baumlerc baumlerc 42068746 Sep 19 10:34 SRR8849208.fastq.zip
-rw-rw-r-- 1 baumlerc baumlerc 34818534 Sep 19 10:32 SRR8849216.fastq.zip
-rw-rw-r-- 1 baumlerc baumlerc 31371871 Sep 19 10:31 SRR8849289.fastq.zip
baumlerc@farm:~/download-seq/download-seq$ ls -l manysigs/
total 196629
-rw-rw-r-- 1 baumlerc baumlerc 66234192 Sep 19 10:27 SRR12480103.fastq.zip
-rw------- 1 baumlerc baumlerc 26729924 Sep 19 10:55 SRR13122219.fastq.zip
-rw-rw-r-- 1 baumlerc baumlerc 42068746 Sep 19 10:34 SRR8849208.fastq.zip
-rw-rw-r-- 1 baumlerc baumlerc 34818534 Sep 19 10:32 SRR8849216.fastq.zip
-rw-rw-r-- 1 baumlerc baumlerc 31371871 Sep 19 10:31 SRR8849289.fastq.zip
baumlerc@farm:~/download-seq/download-seq$ ls -l manysigs/
total 196629
-rw------- 1 baumlerc baumlerc 66235128 Sep 19 11:09 SRR12480103.fastq.zip
-rw------- 1 baumlerc baumlerc 26730476 Sep 19 11:38 SRR13122219.fastq.zip
-rw------- 1 baumlerc baumlerc 42069671 Sep 19 11:19 SRR8849208.fastq.zip
-rw------- 1 baumlerc baumlerc 34819459 Sep 19 11:16 SRR8849216.fastq.zip
-rw------- 1 baumlerc baumlerc 31372796 Sep 19 11:16 SRR8849289.fastq.zip
ctb commented 10 months ago

Took another look at this - from what I recall of our in-person conversation, you're running into the following problems/questions:

The sketchall command you're running is sourmash scripts sketchall fastq/ -p scaled=1000,k=21,k=31,k=51,abund -j 64 -o manysigs/ and you're running this in ~baumlerc/download-seq/download-seq and the Snakefile is named download-sketch.

The Snakefile does a lot of things and takes a long time, so I'm not (yet) trying to run it directly, but I note that there is a repeat in the benchmarks: section of the relevant rule that means it's going to run it multiple times. This should have the effect that snakemake repeats the rule 5 times, right? In which case I'd expect to see multiple copies of the signatures in the zip files. Isn't this behavior expected?

I would advise against doing repetitive benchmarking of really big/large-file/long-running workflows as a routine thing, unless there's a very specific reason to do it. Re-downloading the files in particular seems like an odd thing to benchmark ;).

ctb commented 10 months ago

ok, in /home/ctbrown/2023-ccbaumler-debug, I have a script run.sh that runs the relevant command on a small directory of FASTA files.

I do observe that when I run it twice, I get the doubled output you see above:

sourmash sig summarize manysigs/0.fa.zip

shows 2 copies of each sketch, while

sourmash sig describe manysigs/0.fa.zip

shows only one. That's clearly a bug 🤔 .

I'm not sure how to think about the sketchall problem that if you run it multiple times, you'll get doubled, tripled, etc. sketches. It is both logical (it's sort of what I'd expect to happen!) but also annoying. So maybe I should add a warning message to sketchall.

Let me know about the repeat behavior and if that's unexpected!

ctb commented 10 months ago

Oh and no idea what's going on with the file permissions issue, but that's probably a snakemake thing. By default I would expect your umask settings to be specifying output permissions unless snakemake is changing it; what does umask report?

ctb commented 10 months ago

describe/summarize oddity punted to https://github.com/sourmash-bio/sourmash/issues/2774