Open ccbaumler opened 10 months ago
Ran into a out of memory error while running. I will try this again with 50G instead of 8G.
Resources:
threads: 64
resources: mem_mb=8192, mem_mib=7813, disk_mb=2891102, disk_mib=2757170, tmpdir=<TBD>, time=4320, partition=high2, nodes=1, runtime=4320
Error:
Some of your processes may have been killed by the cgroup out-of-memory handler.
Interesting development with the sketchall plugin, upon completion of the set of signatures the sketchall command starts over and places a second sketch in the signature file.
input:
fastq = expand("fastq/{run_id}.fastq", run_id=run_ids),
output:
all_sigs = expand("manysigs/{run_id}.fastq.zip", run_id=run_ids),
params:
manysigs = "manysigs/",
fastq = "fastq/",
k_list = lambda wildcards: ",".join([f"k={k}" for k in config["k-size"]]),
scale = config.get("scaled-value")
log:
"logs/sourmash_script.log"
conda:
"envs/download-sketch-env.yaml"
shell:
"""
sourmash scripts sketchall {params.fastq} -p scaled={params.scale},{params.k_list},abund -j {threads} -o {params.manysigs}
sourmash sig fileinfo manysigs/SRR8849274.fastq.zip
== This is sourmash version 4.8.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
** loading from 'manysigs/SRR8849274.fastq.zip'
path filetype: ZipFileLinearIndex
location: /home/baumlerc/download-seq/download-seq/manysigs/SRR8849274.fastq.zip
is database? yes
has manifest? yes
num signatures: 6
** examining manifest...
total hashes: 5508866
summary of sketches:
2 sketches with DNA, k=21, scaled=1000, abund 1437902 total hashes
2 sketches with DNA, k=31, scaled=1000, abund 1764610 total hashes
2 sketches with DNA, k=51, scaled=1000, abund 2306354 total hashes
could you report what command is actually run by snakemake? e.g. output of snakemake -p
? thanks!
(you can do snakemake -p -n ...
if you don't want to re-run the actual command)
Sure, I actually haven't touched the tmux pane this workflow died in...
threads: 64
resources: mem_mb=51200, mem_mib=48829, disk_mb=<TBD>, tmpdir=<TBD>, time=4320, partition=high2, nodes=1, attempt_cnt=6, runtime=4320
sourmash scripts sketchall fastq/ -p scaled=1000,k=21,k=31,k=51,abund -j 64 -o manysigs/
I went to take a look on farm and it seems like you deleted the output already?
I am curious about the contents of the fastq/
directory and also what sourmash sig describe
reports. In particular, I wanted to know if the pairs of sketches were in fact duplicates (same md5sum).
If you get to this point again, let me know pls! Alternatively, if you can reproduce this with smaller files, that'd be great :).
Sorry, had to clear out disk space for other work. I'll generate some new signatures to inspect and let you know.
When workflow was allowed to run for ~1hr30min:
sourmash sig fileinfo manysigs/SRR13122219.fastq.zip
== This is sourmash version 4.8.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
** loading from 'manysigs/SRR13122219.fastq.zip'
path filetype: ZipFileLinearIndex
location: /home/baumlerc/download-seq/download-seq/manysigs/SRR13122219.fastq.zip
is database? yes
has manifest? yes
num signatures: 9
** examining manifest...
total hashes: 10312863
summary of sketches:
3 sketches with DNA, k=21, scaled=1000, abund 3015327 total hashes
3 sketches with DNA, k=31, scaled=1000, abund 3418467 total hashes
3 sketches with DNA, k=51, scaled=1000, abund 3879069 total hashes
sourmash sig describe manysigs/SRR13122219.fastq.zip
== This is sourmash version 4.8.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
---
signature filename: /home/baumlerc/download-seq/download-seq/manysigs/SRR13122219.fastq.zip
signature: SRR13122219.7524428 7524428 length=302
source file: fastq/SRR13122219.fastq
md5: 7a313940b4a3cc133d92f630ab3033d4
k=21 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=1
size: 1005109
sum hashes: 1975306
signature license: CC0
---
signature filename: /home/baumlerc/download-seq/download-seq/manysigs/SRR13122219.fastq.zip
signature: SRR13122219.7524428 7524428 length=302
source file: fastq/SRR13122219.fastq
md5: 2dd101186ac9e2762a4f3b193021a8aa
k=31 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=1
size: 1139489
sum hashes: 1928606
signature license: CC0
---
signature filename: /home/baumlerc/download-seq/download-seq/manysigs/SRR13122219.fastq.zip
signature: SRR13122219.7524428 7524428 length=302
source file: fastq/SRR13122219.fastq
md5: 0be1de4b798342b24cc39f8ab750e822
k=51 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=1
size: 1293023
sum hashes: 1853767
signature license: CC0
loaded 3 signatures total, from 1 files
(sig files are not readable by anyone other than you - can you chmod a+r
them please? also - what do you expect to have happen here? I do see some oddities, but wondering what you wanted to see :)
I have changed the permissions.
For posterity, The files change permission throughout the sketchall process:
baumlerc@farm:~/download-seq/download-seq$ ls -l manysigs/
total 196629
-rw-rw-r-- 1 baumlerc baumlerc 66234192 Sep 19 10:27 SRR12480103.fastq.zip
-rw-rw-r-- 1 baumlerc baumlerc 26728996 Sep 19 10:17 SRR13122219.fastq.zip
-rw-rw-r-- 1 baumlerc baumlerc 42068746 Sep 19 10:34 SRR8849208.fastq.zip
-rw-rw-r-- 1 baumlerc baumlerc 34818534 Sep 19 10:32 SRR8849216.fastq.zip
-rw-rw-r-- 1 baumlerc baumlerc 31371871 Sep 19 10:31 SRR8849289.fastq.zip
baumlerc@farm:~/download-seq/download-seq$ ls -l manysigs/
total 196629
-rw-rw-r-- 1 baumlerc baumlerc 66234192 Sep 19 10:27 SRR12480103.fastq.zip
-rw------- 1 baumlerc baumlerc 26729924 Sep 19 10:55 SRR13122219.fastq.zip
-rw-rw-r-- 1 baumlerc baumlerc 42068746 Sep 19 10:34 SRR8849208.fastq.zip
-rw-rw-r-- 1 baumlerc baumlerc 34818534 Sep 19 10:32 SRR8849216.fastq.zip
-rw-rw-r-- 1 baumlerc baumlerc 31371871 Sep 19 10:31 SRR8849289.fastq.zip
baumlerc@farm:~/download-seq/download-seq$ ls -l manysigs/
total 196629
-rw------- 1 baumlerc baumlerc 66235128 Sep 19 11:09 SRR12480103.fastq.zip
-rw------- 1 baumlerc baumlerc 26730476 Sep 19 11:38 SRR13122219.fastq.zip
-rw------- 1 baumlerc baumlerc 42069671 Sep 19 11:19 SRR8849208.fastq.zip
-rw------- 1 baumlerc baumlerc 34819459 Sep 19 11:16 SRR8849216.fastq.zip
-rw------- 1 baumlerc baumlerc 31372796 Sep 19 11:16 SRR8849289.fastq.zip
Took another look at this - from what I recall of our in-person conversation, you're running into the following problems/questions:
The sketchall command you're running is sourmash scripts sketchall fastq/ -p scaled=1000,k=21,k=31,k=51,abund -j 64 -o manysigs/
and you're running this in ~baumlerc/download-seq/download-seq
and the Snakefile is named download-sketch
.
The Snakefile does a lot of things and takes a long time, so I'm not (yet) trying to run it directly, but I note that there is a repeat
in the benchmarks:
section of the relevant rule that means it's going to run it multiple times. This should have the effect that snakemake repeats the rule 5 times, right? In which case I'd expect to see multiple copies of the signatures in the zip files. Isn't this behavior expected?
I would advise against doing repetitive benchmarking of really big/large-file/long-running workflows as a routine thing, unless there's a very specific reason to do it. Re-downloading the files in particular seems like an odd thing to benchmark ;).
ok, in /home/ctbrown/2023-ccbaumler-debug
, I have a script run.sh
that runs the relevant command on a small directory of FASTA files.
I do observe that when I run it twice, I get the doubled output you see above:
sourmash sig summarize manysigs/0.fa.zip
shows 2 copies of each sketch, while
sourmash sig describe manysigs/0.fa.zip
shows only one. That's clearly a bug 🤔 .
I'm not sure how to think about the sketchall problem that if you run it multiple times, you'll get doubled, tripled, etc. sketches. It is both logical (it's sort of what I'd expect to happen!) but also annoying. So maybe I should add a warning message to sketchall.
Let me know about the repeat
behavior and if that's unexpected!
Oh and no idea what's going on with the file permissions issue, but that's probably a snakemake thing. By default I would expect your umask
settings to be specifying output permissions unless snakemake is changing it; what does umask
report?
describe/summarize oddity punted to https://github.com/sourmash-bio/sourmash/issues/2774
While it may be self-evident the memory and cpu resources required by the new sketchall plugin for sourmash, the time it requires to run is less so.
When using this multi-threaded approach on large files/making large databases allow for a long processing time (think days instead of hours). I will report back some benchmarks. In the mean time, here is what I have found:
I was using 86 metagenomes from Petabyte Scale Sequence Search: Metagenomics Benchmarking Codeathon. The smallest of these is 3.3G and the largest is 41G.
When running in a workflow, the job would end before the output was complete (see code block for the resource report). I have reset the the workflow job to run for three days.