sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
475 stars 80 forks source link

dynamically adapt `scaled` for gather #2160

Open ctb opened 2 years ago

ctb commented 2 years ago

conversation with @bluegenes -

for sourmash gather, we could (a) set scaled automatically based on the threshold-bp (whether default or specified by user). This would be a move towards “automatic” choices that would let us do things like adjust thresholds and so on for molecule type. :blue_heart: 1

we could also support some kind of adaptive scaled, more generally. I’m not really sure how to do that offhand, but it seems like maybe after prefetch finds the first few matches with tons of good overlap we could say, hey, we’re going to lower scaled if you don’t mind. this could hand in hand with adaptive sketch loading where we only load the k-mers necessary for the specified scaled.

ctb commented 2 years ago

banding: https://github.com/sourmash-bio/sourmash/issues/1578

adaptive thresholding: https://github.com/sourmash-bio/sourmash/issues/2145

mr-eyes commented 2 years ago

I was chatting with @luizirber, and I thought this "low-abundant kmers dynamic filter" https://github.com/onecodex/finch-rs/blob/8eb6a0c83a58ce381e736c5997f6944196bcdfe8/lib/src/filtering.rs#L155-L162 might be an extra optimization step to the one mentioned in this issue.