sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
453 stars 78 forks source link

some more hot take ideas on making search and gather faster, for interactivity purposes #1578

Open ctb opened 3 years ago

ctb commented 3 years ago

while I'm thinking about it... it would be nice to move towards genuinely interactive search, gather,and MAGsearch.

greyhound #1226 and greyhound.sourmash.bio/ is super cool, of course, but it's "only" searching GTDB r95 and isn't well integrated with the core sourmash code base. I think oxidized LinearIndex #1526 will probably help a great deal with core sourmash, and I think that's planned for merge fairly soon (?). So what other options are there?


over in the MAGsearch blog post, I had a few suggestions for improving MAGsearch that I think apply pretty well -

A few more thoughts, in no particular order, mixing optimization tricks with sourmash internal knowledge with whatever else.

better/more functional inverted indices, in rust

something like the LCA database, in rust, would be cool. #1131 for example. I think @luizirber had some code somewhere, too, but not sure if it was being integrated into sourmash.

techniques to ameliorate load time of LCA databases

LCA databases should be pretty fast for most searches, but they're slow to load.

https://github.com/dib-lab/sourmash/issues/1484 talks about client/server search, where we could preload SBT or LCA indices, and even LinearIndexes, and query them.

we could also try out a sqlite backend for revindex/LCA, https://github.com/dib-lab/sourmash/issues/821; see #1808.

taking advantage of banding with scaled signatures

one theoretically solid approach that remains to be implemented would be to parallelize signature search across different "bands" of hash space.

e.g. take a scaled=1000 signature, produce N=10 (or 100 :) non-overlapping scaled=N*1000 signatures from it.

search those (smaller, faster) signatures against a database in parallel, and do some minimally clever sorting foo to identify signatures that collectively go over the threshold.

voila, massively parallel scaled signature search.

this is also potentially a way to make higher resolution signature search work, of course.

there are many variations on this approach that could be better, but I haven't thought through them yet.

cheap and hacky but maybe effective trick: raise the threshold for gather as quickly as possible

because of the way gather works with indexed databases, we might be able to get significant mileage out of providing smaller databases first on the command line, even if there's redundancy with later databases. this is because gather will raise the threshold for its searches quite quickly.

for example, if you put the GTDB-genomic-reps (or more generically any smaller dereplicated database) on the command line before GTDB-entire, then the search threshold for matches from GTDB entire would be raised based on the matches to GTDB-genomic-reps.

(in thinking about this, I'm actually pretty sure it will not work properly with prefetch, actually. but it might be possible to change the code to make it work better across multiple databases, or something :)

ctb commented 2 years ago

This is kind of included in the banded discussion above, but is a simpler concept to grok:

We could also start out with high scaled values that can detect big overlaps robustly; these go very fast. Then, remove those big overlaps, and lower your scaled values iteratively, and repeat.

The challenge that has prevented me from doing this so far is that I don't know exactly how to decide what size match is small enough that we should abandon the current scaled and increase the resolution. For example, if you get 5 hash matches at a scaled of 100,000, is that still good enough? Or should you then bump down to (say) 50,000 and repeat the search?

A poor/dumber version of this would be to use picklists and/or multiple databases:

Anyway, I think we have lots of mechanisms to do the iteration. Just not sure when to trigger the iteration.

ctb commented 1 week ago

this issue has aged well :)