sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
471 stars 80 forks source link

move `signature::select` downsampling to `minhash()` or `get_sketch()`? #3029

Open bluegenes opened 7 months ago

bluegenes commented 7 months ago

At the moment, we use signature::select to downsample minhashes if needed. However, this means we load the minhash during the select, which may happen during sig_for_dataset, for example, and then discard it, returning just the signature. That means the minhash needs to be loaded again by the user.

We can avoid loading twice by moving downsampling into the.minhash() or .get_sketch() methods instead. What do you think @luizirber?

the downside is that there are other ways to get the minhash, e.g. .sketches()[0]^, and those wouldn't have the downsampled signature even after having run select on the signature.

^ I know we're thinking of deprecating sketches, but there probably other ways too?

bluegenes commented 7 months ago

Just remembered that scaled is in minhash, not signature, so we don't have access without loading the minhash. So this would mean no scaled selection in signature::select, which is probably not desirable.