some more hot take ideas on making search and gather faster, for interactivity purposes

while I'm thinking about it... it would be nice to move towards genuinely interactive search, gather,and MAGsearch.

greyhound #1226 and greyhound.sourmash.bio/ is super cool, of course, but it's "only" searching GTDB r95 and isn't well integrated with the core sourmash code base. I think oxidized LinearIndex #1526 will probably help a great deal with core sourmash, and I think that's planned for merge fairly soon (?). So what other options are there?

over in the MAGsearch blog post, I had a few suggestions for improving MAGsearch that I think apply pretty well -

First, the MAGsearch code doesn't do anything special in terms of loading; it's using the default sourmash signature format, which is JSON. For example, binary encodings would decrease the collection size a lot, while also speeding up search (by decreasing the load time).
Second, searching the signatures is done linearly, and uses Rust to do so in parallel. It uses the same Rust code that underlies sourmash (but is several versions behind the latest version). Making use of recent improvements in sourmash Rust code would probably speed this up several fold.

A few more thoughts, in no particular order, mixing optimization tricks with sourmash internal knowledge with whatever else.

better/more functional inverted indices, in rust

something like the LCA database, in rust, would be cool. #1131 for example. I think @luizirber had some code somewhere, too, but not sure if it was being integrated into sourmash.

techniques to ameliorate load time of LCA databases

LCA databases should be pretty fast for most searches, but they're slow to load.

https://github.com/dib-lab/sourmash/issues/1484 talks about client/server search, where we could preload SBT or LCA indices, and even LinearIndexes, and query them.

we could also try out a sqlite backend for revindex/LCA, https://github.com/dib-lab/sourmash/issues/821; see #1808.

taking advantage of banding with scaled signatures

one theoretically solid approach that remains to be implemented would be to parallelize signature search across different "bands" of hash space.

e.g. take a scaled=1000 signature, produce N=10 (or 100 :) non-overlapping scaled=N*1000 signatures from it.

search those (smaller, faster) signatures against a database in parallel, and do some minimally clever sorting foo to identify signatures that collectively go over the threshold.

voila, massively parallel scaled signature search.

this is also potentially a way to make higher resolution signature search work, of course.

there are many variations on this approach that could be better, but I haven't thought through them yet.

cheap and hacky but maybe effective trick: raise the threshold for gather as quickly as possible

because of the way gather works with indexed databases, we might be able to get significant mileage out of providing smaller databases first on the command line, even if there's redundancy with later databases. this is because gather will raise the threshold for its searches quite quickly.

for example, if you put the GTDB-genomic-reps (or more generically any smaller dereplicated database) on the command line before GTDB-entire, then the search threshold for matches from GTDB entire would be raised based on the matches to GTDB-genomic-reps.

(in thinking about this, I'm actually pretty sure it will not work properly with prefetch, actually. but it might be possible to change the code to make it work better across multiple databases, or something :)

This is kind of included in the banded discussion above, but is a simpler concept to grok:

We could also start out with high scaled values that can detect big overlaps robustly; these go very fast. Then, remove those big overlaps, and lower your scaled values iteratively, and repeat.

The challenge that has prevented me from doing this so far is that I don't know exactly how to decide what size match is small enough that we should abandon the current scaled and increase the resolution. For example, if you get 5 hash matches at a scaled of 100,000, is that still good enough? Or should you then bump down to (say) 50,000 and repeat the search?

A poor/dumber version of this would be to use picklists and/or multiple databases:

run search at scaled=100000, extract those signatures at full resolution, run them against database to remove matches
run at scaled=50000, extract at full resolution, remove matches, continue

Anyway, I think we have lots of mechanisms to do the iteration. Just not sure when to trigger the iteration.

sourmash-bio / sourmash