onecodex / finch-rs

A genomic minhashing implementation in Rust
https://www.onecodex.com
MIT License
92 stars 8 forks source link

Attempting Excessive Memory Allocation #31

Closed dcdanko closed 5 years ago

dcdanko commented 5 years ago

Finch is trying to allocate over 100GB of memory, how can I limit this to a more reasonable amount?

The fastq file in question is fairly small, ~2GB

memory allocation of 103079215104 bytes failed/bin/bash: line 1:  7007 Aborted                 /home/dcd3001/.cargo/bin/finch sketch --no-strict --seed 42 --n-hashes 10000000 --binary-format -o SL280215/SL280215.finch_sketch.sketch.msh SL280215/SL280215.filter_human_dna.nonhuman_read1.fastq.gz
boydgreenfield commented 5 years ago

@dcdanko The large memory blow up is because you have 10M hashes (which is probably on the order of ~500MB since we store the k-mer, the minmer for each hash, and then a number of additional per-minmer bookkeeping fields) and finch does "over-sketching" of FASTQ files in order to eliminate spurious error k-mers (see the README for more details).

By default this is a 200-fold change, so that gets you ~500MB -> ~100GB. You can pass --oversketch 1 as a parameter to turn this behavior off (use a 1:1 ratio for the "oversketch" to sketch size) and/or use fewer hashes.

dcdanko commented 5 years ago

Oversketching seems useful. Will I still get some useful effects with a smaller --oversketch value?

For reference this is related to this PR https://github.com/MetaSUB/MetaSUB_CAP/pull/25 by @bovee

boydgreenfield commented 5 years ago

@dcdanko I'd actually recommend using a smaller number of hashes (no more than 100k?) and then you can also play around with the --oversketch parameter and set it down to 100 or 50 or similar without a problem.

If you're trying for exact Mash compatibility, you should keep the --seed 42 param and then also use --no-filter, but I'd just lean towards dropping the --seed parameter and trying something more like finch sketch $FILE --n-hashes 10000 as a good nearly "out-of-the-box" setting (default n=1000).

dcdanko commented 5 years ago

No we definitely want to leave n as is.

We've found that Mash Sketches don't perform particularly well for metagenomes below that threshold.

boydgreenfield commented 5 years ago

@dcdanko I'd guess part of the reason you needed those to be so large is because they had tons of error k-mers in them (though I know Mash can filter singletons out). I'd experiment with 100k or 1M then and perhaps setting the --min-abun-filter and --oversketch parameters manually.

dcdanko commented 5 years ago

I'm not so sure about that- low abundance strains are virtually indistinguishable from sequencing errors in our setting.

Which parameters should I set to have finch behave exactly like (very fast) vanilla MASH? Will--oversketch 0 do it?

boydgreenfield commented 5 years ago

@dcdanko Using --no-filter --seed 42 will have it behave exactly like Mash, assuming you're not using Mash's "filter out singletons" flag (I don't think that's the default?). There isn't an exact command to replicate that flag.

dcdanko commented 5 years ago

Thanks!

jianshu93 commented 2 years ago

Hello All,

I never succeed even with only 1000 genomes:

finch sketch -k 28 --oversketch 1 -n 12000 -o ../GWMC.finch *.fasta

I have 100 GB memory available.

Any idea why? When it comes to the entire GTDB database (47893 genomes), how should I do this larger collection?

Thanks,

Jianshu