sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
473 stars 80 forks source link

subdivide sketches into equal sized chunks for better SBTs? #1832

Open ctb opened 2 years ago

ctb commented 2 years ago

Over in https://github.com/ctb/2022-sourmash-chunk-sigs, I'm trying out an indexing scheme for large databases with extremely uneven size distribution of sketches (PFAM, in this case).

It's kind of janky, but the underlying idea is sound and probably something that people do more robustly with clever Bloom filter indexing schemes -

you should end up with a much more balanced SBT, and as long as you don't do any thresholding and only care about the name/filename match, this technique should work. Those are big caveats, admittedly...

ctb commented 2 years ago

for PFAM, k=10 scaled=1, with 19k sequence files, we have 427150 sketches under 10k in size. It's about 60 GB to build the SBT, 300 minutes of compute time. 858851 total nodes to save.