sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
476 stars 79 forks source link

create a downsampled iterator so that downsampling becomes less dependent on `clone` #3355

Open ctb opened 1 month ago

ctb commented 1 month ago

per comment on #3342, https://github.com/sourmash-bio/sourmash/pull/3342#discussion_r1801549808

@luizirber speaketh:

I wonder if we can do (in a future PR, not this one) a new .downsampled_iter(scaled) for operations like count_common, and avoid the conversion.

The downsampled iter would iterate over values that are in the appropriate scaled value, but wouldn't need to create new minhash sketches (can reuse the largest one and stop returning values once they go over max_hash, for example)

ctb commented 1 week ago

maybe explored here? https://github.com/sourmash-bio/sourmash/pull/3394

luizirber commented 1 week ago

maybe explored here? #3394

In a similar direction, but not quite. The downsample iter is more similar to .iter_mins() or .iter_abunds() in #3394, but there I'm only using iterators directly to calculate the intersection.