sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
476 stars 79 forks source link

`downsample_*` functions in minhash.rs _always_ downsample, even when downsampling is not necessary #3343

Closed ctb closed 1 month ago

ctb commented 1 month ago

The root cause of the 10x slowdown in the branchwater plugin (https://github.com/sourmash-bio/sourmash_plugin_branchwater/issues/463) was that downsample_scaled was being called on every against sketch, and it was creating a new sketch every time it was called - even when no downsampling needed to be done, because the scaled values match.

This is fixed in https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/464 by only doing the downsampling if the scale does not match.

I think the sourmash code base behavior is somewhat unexpected and we should avoid doing expensive work when it's not needed. But I'm not sure how to modify the function signatures appropriately.

ctb commented 1 month ago

From my question on slack:

Hi all, a Rust conundrum that I don’t quite know how to resolve. Maybe writing it out will help. So, over in sourmash minhash.rs, we have a function downsample_max_hash (implemented on KmerMinHash) that currently always returns a new copy of KmerMinHash, suitably modified.

In my ideal world, we would change this function to only return a new sketch if it has to change the sketch, but would return itself (unmodified) if not - if the requested max_hash is the same as the current one, there’s no need to create a new one.

Now obviously you can’t have a function return both KmerMinHash and &KmerMinHash.

So how do I do this, structurally?

Is this a place where Clone on Write would be useful? And/or can I change the function to return just a reference, and then would I need to use lifetimes or something?

You can see what I had to do in a function that uses these functions, here:

https://github.com/sourmash-bio/sourmash_plugin_branchwater/blob/453f943351c6c702235e1b085cd04d3616b1a09a/src/manysearch.rs#L219

the call to query.inflated_abundances is the only thing that is different in the if/else statement between doing the downsampling and not - that’s because against_ds here is a new copy in the if branch, while we’re using against in the else.

ctb commented 1 month ago

luiz writes:

CoW would be a solution, but what about changing the fn sig to

    pub fn downsample_max_hash(self, max_hash: u64) -> Result<KmerMinHash, Error>

?

This avoid possible lifetime issues by giving control to whoever calls the function

then the check for downsampling or not would be done inside the downsample_max_hash function

and if nothing needs to be done, just return self

kind of how select() works https://github.com/sourmash-bio/sourmash/blob/34001e7cabfd6a9a464fc85f837751ad4746790e/src/core/src/manifest.rs#L262

ctb commented 1 month ago

resolved in https://github.com/sourmash-bio/sourmash/pull/3342