Closed ctb closed 4 years ago
Built the LCA DB for bacteria! Requires 18 GB of RAM to load, it looks like, and about 5 minutes to search for a single genome. But it's a small file at only 405MB! Yay?
% /usr/bin/time sourmash gather $Y outputs/lca/scaled/genbank-bacteria-k31-scaled10k.lca.json.gz
== This is sourmash version 2.0.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
selecting default query k=31.
loaded query: AAEZMM010000001.1 Salmonella e... (k=31, DNA)
loaded 1 databases.
overlap p_query p_match avg_abund
--------- ------- ------- ---------
4.7 Mbp 100.0% 100.0% 1.0 AAADRZ010000001.1 Salmonella enterica...
found 1 matches total;
the recovered matches hit 100.0% of the query
316.99user 35.32system 5:50.61elapsed 100%CPU (0avgtext+0avgdata 18766200maxresident)k
1227944inputs+24outputs (51major+31793920minor)pagefaults 0swaps```
You've gotta love SBTs. 13 seconds to search all 500,000 bacterial genomes, in under 1 GB of RAM.
% /usr/bin/time sourmash gather $X outputs/trees/scaled/genbank-bacteria-d2-x1e5-k31.sbt.json
== This is sourmash version 2.0.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
selecting default query k=31.
loaded query: AAEZMM010000001.1 Salmonella e... (k=31, DNA)
loaded 1 databases.
overlap p_query p_match avg_abund
--------- ------- ------- ---------
4.7 Mbp 100.0% 100.0% 1.0 AAEZMM010000001.1 Salmonella enterica...
found 1 matches total;
the recovered matches hit 100.0% of the query
12.86user 3.81system 0:33.25elapsed 50%CPU (0avgtext+0avgdata 1064704maxresident)k
1204000inputs+583984outputs (68major+396507minor)pagefaults 0swaps
OTOH the SBT directory is 30GB uncompressed! So, um, ok.
You've gotta love SBTs. 13 seconds to search all 500,000 bacterial genomes, in under 1 GB of RAM.
% /usr/bin/time sourmash gather $X outputs/trees/scaled/genbank-bacteria-d2-x1e5-k31.sbt.json == This is sourmash version 2.0.1. ==
And you're using sourmash 2.0.1, it should be faster in newer versions.
loaded query: AAEZMM010000001.1 Salmonella e... (k=31, DNA)
That is a pretty small query for gather, but glad it was found quickly =]
OTOH the SBT directory is 30GB uncompressed! So, um, ok.
Possible solutions:
storage
, written to tempfile, and then loaded into a Nodegraph from disk. This is a limitation in khmer, but the rust Nodegraph can be loaded from a memory buffer)Another comment: For search
we only need to load internal nodes once (they will never be checked again). This helps saving total memory consumed (because we can unload the internal node after checking it). The feature/unload
branch expose this in the find
function, but needs more tests.
This is not so useful for gather
, because internal nodes might be checked more than once, but might lead to a "low-memory" mode that always unload internal node data, or a mixed approach where we cache the internal node data for frequently accessed nodes.
A dirty version of this is in the unassigned.py
I wrote for @taylorreiter, but I would rather avoid having to dig into private fields like this and have a proper method =P
compress internal nodes - is that Rust dependent?
compress internal nodes - is that Rust dependent?
No, it should work now (it's a khmer feature). Need to 1) change indexing code to generate compressed nodes 2) or load an SBT/compress internal nodes/update .sbt.json file (make a new command to port old SBTs?)
all done; here are the sizes of the SBTs.
173M .sbt.genbank-archaea-d10-k21
102M .sbt.genbank-archaea-d10-x1e4-k21
102M .sbt.genbank-archaea-d10-x1e4-k31
102M .sbt.genbank-archaea-d10-x1e4-k51
113M .sbt.genbank-archaea-d10-x1e5-k21
113M .sbt.genbank-archaea-d10-x1e5-k31
113M .sbt.genbank-archaea-d10-x1e5-k51
173M .sbt.genbank-archaea-d10-x1e6-k21
173M .sbt.genbank-archaea-d10-x1e6-k31
174M .sbt.genbank-archaea-d10-x1e6-k51
623M .sbt.genbank-archaea-d2-k21
176M .sbt.genbank-archaea-d2-x1e4-k21
176M .sbt.genbank-archaea-d2-x1e4-k31
177M .sbt.genbank-archaea-d2-x1e4-k51
222M .sbt.genbank-archaea-d2-x1e5-k21
222M .sbt.genbank-archaea-d2-x1e5-k31
223M .sbt.genbank-archaea-d2-x1e5-k51
623M .sbt.genbank-archaea-d2-x1e6-k21
623M .sbt.genbank-archaea-d2-x1e6-k31
624M .sbt.genbank-archaea-d2-x1e6-k51
29G .sbt.genbank-bacteria-d2-x1e5-k21
30G .sbt.genbank-bacteria-d2-x1e5-k31
30G .sbt.genbank-bacteria-d2-x1e5-k51
1.3G .sbt.genbank-fungi-d10-k21
1.1G .sbt.genbank-fungi-d10-x1e4-k21
1.1G .sbt.genbank-fungi-d10-x1e4-k31
1.1G .sbt.genbank-fungi-d10-x1e4-k51
1.1G .sbt.genbank-fungi-d10-x1e5-k21
1.1G .sbt.genbank-fungi-d10-x1e5-k31
1.1G .sbt.genbank-fungi-d10-x1e5-k51
1.3G .sbt.genbank-fungi-d10-x1e6-k21
1.3G .sbt.genbank-fungi-d10-x1e6-k31
1.3G .sbt.genbank-fungi-d10-x1e6-k51
2.7G .sbt.genbank-fungi-d2-k21
1.1G .sbt.genbank-fungi-d2-x1e4-k21
1.1G .sbt.genbank-fungi-d2-x1e4-k31
1.1G .sbt.genbank-fungi-d2-x1e4-k51
1.3G .sbt.genbank-fungi-d2-x1e5-k21
1.3G .sbt.genbank-fungi-d2-x1e5-k31
1.3G .sbt.genbank-fungi-d2-x1e5-k51
2.7G .sbt.genbank-fungi-d2-x1e6-k21
2.7G .sbt.genbank-fungi-d2-x1e6-k31
2.8G .sbt.genbank-fungi-d2-x1e6-k51
549M .sbt.genbank-viral-d10-k21
322M .sbt.genbank-viral-d10-x1e4-k21
328M .sbt.genbank-viral-d10-x1e4-k31
335M .sbt.genbank-viral-d10-x1e4-k51
334M .sbt.genbank-viral-d10-x1e5-k21
340M .sbt.genbank-viral-d10-x1e5-k31
348M .sbt.genbank-viral-d10-x1e5-k51
549M .sbt.genbank-viral-d10-x1e6-k21
555M .sbt.genbank-viral-d10-x1e6-k31
563M .sbt.genbank-viral-d10-x1e6-k51
2.6G .sbt.genbank-viral-d2-k21
671M .sbt.genbank-viral-d2-x1e4-k21
675M .sbt.genbank-viral-d2-x1e4-k31
684M .sbt.genbank-viral-d2-x1e4-k51
726M .sbt.genbank-viral-d2-x1e5-k21
733M .sbt.genbank-viral-d2-x1e5-k31
741M .sbt.genbank-viral-d2-x1e5-k51
2.6G .sbt.genbank-viral-d2-x1e6-k21
2.6G .sbt.genbank-viral-d2-x1e6-k31
2.6G .sbt.genbank-viral-d2-x1e6-k51
This is pretty out of date with the new .sbt.zip stuff. Closing as irrelevant.
A few notes from things posted to slack --
for fungi, d2 search took 8 seconds; d10 search took 35 seconds.
presumably this is because when you’re weeding out false hits beneath a node, you have to load an average of d/2 nodes to find the right one, or some such.
all with d2 for fungi alone.
One of the big obstacles to using larger bloom filters here is that we want to compress the bloom filters on disk b/c otherwise they get way too big. I assume that the new buffer based bloom filter stuff in rust allows loading from gzipped files??