Closed dcdanko closed 5 years ago
@dcdanko The large memory blow up is because you have 10M hashes (which is probably on the order of ~500MB since we store the k-mer, the minmer for each hash, and then a number of additional per-minmer bookkeeping fields) and finch
does "over-sketching" of FASTQ files in order to eliminate spurious error k-mers (see the README for more details).
By default this is a 200-fold change, so that gets you ~500MB -> ~100GB. You can pass --oversketch 1
as a parameter to turn this behavior off (use a 1:1 ratio for the "oversketch" to sketch size) and/or use fewer hashes.
Oversketching seems useful. Will I still get some useful effects with a smaller --oversketch
value?
For reference this is related to this PR https://github.com/MetaSUB/MetaSUB_CAP/pull/25 by @bovee
@dcdanko I'd actually recommend using a smaller number of hashes (no more than 100k?) and then you can also play around with the --oversketch
parameter and set it down to 100
or 50
or similar without a problem.
If you're trying for exact Mash compatibility, you should keep the --seed 42
param and then also use --no-filter
, but I'd just lean towards dropping the --seed
parameter and trying something more like finch sketch $FILE --n-hashes 10000
as a good nearly "out-of-the-box" setting (default n=1000).
No we definitely want to leave n as is.
We've found that Mash Sketches don't perform particularly well for metagenomes below that threshold.
@dcdanko I'd guess part of the reason you needed those to be so large is because they had tons of error k-mers in them (though I know Mash can filter singletons out). I'd experiment with 100k or 1M then and perhaps setting the --min-abun-filter
and --oversketch
parameters manually.
I'm not so sure about that- low abundance strains are virtually indistinguishable from sequencing errors in our setting.
Which parameters should I set to have finch behave exactly like (very fast) vanilla MASH? Will--oversketch 0
do it?
@dcdanko Using --no-filter --seed 42
will have it behave exactly like Mash, assuming you're not using Mash's "filter out singletons" flag (I don't think that's the default?). There isn't an exact command to replicate that flag.
Thanks!
Hello All,
I never succeed even with only 1000 genomes:
finch sketch -k 28 --oversketch 1 -n 12000 -o ../GWMC.finch *.fasta
I have 100 GB memory available.
Any idea why? When it comes to the entire GTDB database (47893 genomes), how should I do this larger collection?
Thanks,
Jianshu
Finch is trying to allocate over 100GB of memory, how can I limit this to a more reasonable amount?
The fastq file in question is fairly small, ~2GB