sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
476 stars 79 forks source link

MRG: improve downsampling behavior on `KmerMinHash`; fix `RevIndex::gather` bug around `scaled`. #3342

Closed ctb closed 1 month ago

ctb commented 1 month ago

This PR does five things:

First, it swaps the implementation of KmerMinHash::downsample_max_hash with KmerMinHash::downsample_scaled, and the same for KmerMinHashBTree. Previously a call to downsample_scaled calculated the right max_hash from scaled, then called downsample_max_hash, which then converted max_hash back to scaled. This reverses the logic so that (slightly) less work is done and, more importantly, the code is a bit more straightforward.

Second, it changes the downsample_* functions so that they do not downsample when no downsampling is needed. As part of this the method signatures are changed to take an object, rather than a reference. This lets the functions return an unmodified KmerMinHash when no downsampling is needed.

Third, it turns out the downsample_* functions didn't check to make sure that the new scaled value was larger than the old one, i.e. they didn't prevent upsampling. That check was added and a new error, CannotUpsampleScaled, was added to sourmash core.

Fourth, this uncovered a bug in RevIndex::gather where the query was downsampled to the match, even when the match was lower scaled. This PR rejiggers the code so that downsampling is done appropriately in the gather and calculate_gather_stats. Since RevIndex::gather isn't used in the the sourmash CLI, the bug only presented in the test suite and in the branchwater plugin; see https://github.com/sourmash-bio/sourmash_plugin_branchwater/issues/468 and https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/467, where a fastmultigather test had to be fixed because of the incorrect scaled values output by RevIndex::gather.

Fifth, it includes https://github.com/sourmash-bio/sourmash/pull/3348 from @luizirber, which adds a Signature::try_into() to KmerMinHash to support the elimination of some clones.

Because of the method signature change for the downsample_* functions, the sourmash-core version needs to be bumped to a new major version, 0.16.0.

It's been a fun journey! 😅

Fixes https://github.com/sourmash-bio/sourmash/issues/3343

Some notes on further changes and performance implications:

As a consequence of the RevIndex::gather changes, redundant downsampling has to be done in RevIndex::gather and calculate_gather_stats, unless we want to change the method signature of calculate_gather_stats. I decided the PR was big enough that I didn't want to do that in addition. It should not affect most use cases where scaled is the same, and we will see if it results in any slowdowns over in the branchwater plugin. See https://github.com/sourmash-bio/sourmash/issues/3196 for an issue on all of this.

We could also just insist that the query scaled is the one to pay attention to, per https://github.com/sourmash-bio/sourmash/issues/2951. This would simplify the code in Python-land as well.

Overall, the performance implications of this PR are not clear. Previously downsampling was being done even when it wasn't needed, so this may speed things up quite a lot for our typical use case! On the other hand, redundant downsampling will happen in cases where there are scaled mismatches. We just need to benchmark it, I think.

Some preliminary benchmarking reported in https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/430#issuecomment-2408571735 suggests that fastgather is now much more memory effficient 🎉 so that's good!

TODO:

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 69.23077% with 20 lines in your changes missing coverage. Please review.

Project coverage is 86.42%. Comparing base (1a6312b) to head (80c3404). Report is 1 commits behind head on latest.

Files with missing lines Patch % Lines
src/core/src/signature.rs 10.00% 9 Missing :warning:
src/core/src/sketch/minhash.rs 80.00% 9 Missing :warning:
src/core/src/storage/mod.rs 0.00% 2 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## latest #3342 +/- ## ========================================== - Coverage 86.50% 86.42% -0.09% ========================================== Files 137 137 Lines 16046 16069 +23 Branches 2211 2211 ========================================== + Hits 13881 13888 +7 - Misses 1858 1874 +16 Partials 307 307 ``` | [Flag](https://app.codecov.io/gh/sourmash-bio/sourmash/pull/3342/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=sourmash-bio) | Coverage Δ | | |---|---|---| | [hypothesis-py](https://app.codecov.io/gh/sourmash-bio/sourmash/pull/3342/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=sourmash-bio) | `25.47% <ø> (ø)` | | | [python](https://app.codecov.io/gh/sourmash-bio/sourmash/pull/3342/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=sourmash-bio) | `92.39% <ø> (ø)` | | | [rust](https://app.codecov.io/gh/sourmash-bio/sourmash/pull/3342/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=sourmash-bio) | `62.05% <69.23%> (-0.24%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=sourmash-bio#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ctb commented 1 month ago

Ready for review & merge @sourmash-bio/devs !!

ctb commented 1 month ago

(if & when this is merged, I will cut a new core release too.)

codspeed-hq[bot] commented 1 month ago

CodSpeed Performance Report

Merging #3342 will degrade performances by 16.6%

Comparing refactor_rs_downsample (80c3404) with latest (1a6312b)

Summary

❌ 4 regressions ✅ 17 untouched benchmarks

:warning: _Please fix the performance issues or acknowledge them on CodSpeed._

Benchmarks breakdown

Benchmark latest refactor_rs_downsample Change
abund0_ani_ci0 560.6 µs 672.2 µs -16.6%
abund0_ani_ci1 581.4 µs 692.9 µs -16.09%
abund1_ani_ci0 787.2 µs 898.7 µs -12.41%
abund1_ani_ci1 809.7 µs 921.2 µs -12.1%