sourmash database creation parameter choices & consequences

ctb commented 4 years ago

A few notes from things posted to slack --

d10 SBTs are a lot slower than d2 SBTs.

for fungi, d2 search took 8 seconds; d10 search took 35 seconds.

presumably this is because when you’re weeding out false hits beneath a node, you have to load an average of d/2 nodes to find the right one, or some such.

I also see really dramatic decreases in search time for larger bloom filters (like, duh)

-x1e4 - 74.05user
-x1e5 - 28.87user
-x1e6 - 8.54user

all with d2 for fungi alone.

One of the big obstacles to using larger bloom filters here is that we want to compress the bloom filters on disk b/c otherwise they get way too big. I assume that the new buffer based bloom filter stuff in rust allows loading from gzipped files??

ctb commented 4 years ago

Built the LCA DB for bacteria! Requires 18 GB of RAM to load, it looks like, and about 5 minutes to search for a single genome. But it's a small file at only 405MB! Yay?


% /usr/bin/time sourmash gather $Y outputs/lca/scaled/genbank-bacteria-k31-scaled10k.lca.json.gz
== This is sourmash version 2.0.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

selecting default query k=31.
loaded query: AAEZMM010000001.1 Salmonella e... (k=31, DNA)
loaded 1 databases.

overlap     p_query p_match avg_abund
---------   ------- ------- ---------
4.7 Mbp      100.0%  100.0%       1.0    AAADRZ010000001.1 Salmonella enterica...

found 1 matches total;
the recovered matches hit 100.0% of the query

316.99user 35.32system 5:50.61elapsed 100%CPU (0avgtext+0avgdata 18766200maxresident)k
1227944inputs+24outputs (51major+31793920minor)pagefaults 0swaps```

ctb commented 4 years ago

You've gotta love SBTs. 13 seconds to search all 500,000 bacterial genomes, in under 1 GB of RAM.

% /usr/bin/time sourmash gather $X outputs/trees/scaled/genbank-bacteria-d2-x1e5-k31.sbt.json 
== This is sourmash version 2.0.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

selecting default query k=31.
loaded query: AAEZMM010000001.1 Salmonella e... (k=31, DNA)
loaded 1 databases.

overlap     p_query p_match avg_abund
---------   ------- ------- ---------
4.7 Mbp      100.0%  100.0%       1.0    AAEZMM010000001.1 Salmonella enterica...

found 1 matches total;
the recovered matches hit 100.0% of the query

12.86user 3.81system 0:33.25elapsed 50%CPU (0avgtext+0avgdata 1064704maxresident)k
1204000inputs+583984outputs (68major+396507minor)pagefaults 0swaps

ctb commented 4 years ago

OTOH the SBT directory is 30GB uncompressed! So, um, ok.

luizirber commented 4 years ago

You've gotta love SBTs. 13 seconds to search all 500,000 bacterial genomes, in under 1 GB of RAM.
% /usr/bin/time sourmash gather $X outputs/trees/scaled/genbank-bacteria-d2-x1e5-k31.sbt.json 
== This is sourmash version 2.0.1. ==
And you're using sourmash 2.0.1, it should be faster in newer versions.
loaded query: AAEZMM010000001.1 Salmonella e... (k=31, DNA)
That is a pretty small query for gather, but glad it was found quickly =]

OTOH the SBT directory is 30GB uncompressed! So, um, ok.

Possible solutions:

Short term: compress internal nodes. Should work already.
Medium term:
- read internal nodes from buffers, instead of the tempfile dance happening now. (The tempfile dance: content of internal node is read from storage, written to tempfile, and then loaded into a Nodegraph from disk. This is a limitation in khmer, but the rust Nodegraph can be loaded from a memory buffer)
- load SBTs from compressed files (a zipfile), without having to decompress the file we distribute.
Long term: Dinamically sized internal nodes (with MQF).

luizirber commented 4 years ago

Another comment: For search we only need to load internal nodes once (they will never be checked again). This helps saving total memory consumed (because we can unload the internal node after checking it). The feature/unload branch expose this in the find function, but needs more tests.

This is not so useful for gather, because internal nodes might be checked more than once, but might lead to a "low-memory" mode that always unload internal node data, or a mixed approach where we cache the internal node data for frequently accessed nodes.

A dirty version of this is in the unassigned.py I wrote for @taylorreiter, but I would rather avoid having to dig into private fields like this and have a proper method =P

ctb commented 4 years ago

compress internal nodes - is that Rust dependent?

luizirber commented 4 years ago

compress internal nodes - is that Rust dependent?

No, it should work now (it's a khmer feature). Need to 1) change indexing code to generate compressed nodes 2) or load an SBT/compress internal nodes/update .sbt.json file (make a new command to port old SBTs?)

ctb commented 4 years ago

all done; here are the sizes of the SBTs.

173M    .sbt.genbank-archaea-d10-k21
102M    .sbt.genbank-archaea-d10-x1e4-k21
102M    .sbt.genbank-archaea-d10-x1e4-k31
102M    .sbt.genbank-archaea-d10-x1e4-k51
113M    .sbt.genbank-archaea-d10-x1e5-k21
113M    .sbt.genbank-archaea-d10-x1e5-k31
113M    .sbt.genbank-archaea-d10-x1e5-k51
173M    .sbt.genbank-archaea-d10-x1e6-k21
173M    .sbt.genbank-archaea-d10-x1e6-k31
174M    .sbt.genbank-archaea-d10-x1e6-k51
623M    .sbt.genbank-archaea-d2-k21
176M    .sbt.genbank-archaea-d2-x1e4-k21
176M    .sbt.genbank-archaea-d2-x1e4-k31
177M    .sbt.genbank-archaea-d2-x1e4-k51
222M    .sbt.genbank-archaea-d2-x1e5-k21
222M    .sbt.genbank-archaea-d2-x1e5-k31
223M    .sbt.genbank-archaea-d2-x1e5-k51
623M    .sbt.genbank-archaea-d2-x1e6-k21
623M    .sbt.genbank-archaea-d2-x1e6-k31
624M    .sbt.genbank-archaea-d2-x1e6-k51
29G     .sbt.genbank-bacteria-d2-x1e5-k21
30G     .sbt.genbank-bacteria-d2-x1e5-k31
30G     .sbt.genbank-bacteria-d2-x1e5-k51
1.3G    .sbt.genbank-fungi-d10-k21
1.1G    .sbt.genbank-fungi-d10-x1e4-k21
1.1G    .sbt.genbank-fungi-d10-x1e4-k31
1.1G    .sbt.genbank-fungi-d10-x1e4-k51
1.1G    .sbt.genbank-fungi-d10-x1e5-k21
1.1G    .sbt.genbank-fungi-d10-x1e5-k31
1.1G    .sbt.genbank-fungi-d10-x1e5-k51
1.3G    .sbt.genbank-fungi-d10-x1e6-k21
1.3G    .sbt.genbank-fungi-d10-x1e6-k31
1.3G    .sbt.genbank-fungi-d10-x1e6-k51
2.7G    .sbt.genbank-fungi-d2-k21
1.1G    .sbt.genbank-fungi-d2-x1e4-k21
1.1G    .sbt.genbank-fungi-d2-x1e4-k31
1.1G    .sbt.genbank-fungi-d2-x1e4-k51
1.3G    .sbt.genbank-fungi-d2-x1e5-k21
1.3G    .sbt.genbank-fungi-d2-x1e5-k31
1.3G    .sbt.genbank-fungi-d2-x1e5-k51
2.7G    .sbt.genbank-fungi-d2-x1e6-k21
2.7G    .sbt.genbank-fungi-d2-x1e6-k31
2.8G    .sbt.genbank-fungi-d2-x1e6-k51
549M    .sbt.genbank-viral-d10-k21
322M    .sbt.genbank-viral-d10-x1e4-k21
328M    .sbt.genbank-viral-d10-x1e4-k31
335M    .sbt.genbank-viral-d10-x1e4-k51
334M    .sbt.genbank-viral-d10-x1e5-k21
340M    .sbt.genbank-viral-d10-x1e5-k31
348M    .sbt.genbank-viral-d10-x1e5-k51
549M    .sbt.genbank-viral-d10-x1e6-k21
555M    .sbt.genbank-viral-d10-x1e6-k31
563M    .sbt.genbank-viral-d10-x1e6-k51
2.6G    .sbt.genbank-viral-d2-k21
671M    .sbt.genbank-viral-d2-x1e4-k21
675M    .sbt.genbank-viral-d2-x1e4-k31
684M    .sbt.genbank-viral-d2-x1e4-k51
726M    .sbt.genbank-viral-d2-x1e5-k21
733M    .sbt.genbank-viral-d2-x1e5-k31
741M    .sbt.genbank-viral-d2-x1e5-k51
2.6G    .sbt.genbank-viral-d2-x1e6-k21
2.6G    .sbt.genbank-viral-d2-x1e6-k31
2.6G    .sbt.genbank-viral-d2-x1e6-k51

ctb commented 4 years ago

This is pretty out of date with the new .sbt.zip stuff. Closing as irrelevant.

sourmash-bio / sourmash

sourmash database creation parameter choices & consequences #778