Closed ctb closed 1 month ago
Attention: Patch coverage is 69.23077%
with 20 lines
in your changes missing coverage. Please review.
Project coverage is 86.42%. Comparing base (
1a6312b
) to head (80c3404
). Report is 1 commits behind head on latest.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Ready for review & merge @sourmash-bio/devs !!
(if & when this is merged, I will cut a new core release too.)
Comparing refactor_rs_downsample
(80c3404) with latest
(1a6312b)
❌ 4
regressions
✅ 17
untouched benchmarks
:warning: _Please fix the performance issues or acknowledge them on CodSpeed._
Benchmark | latest |
refactor_rs_downsample |
Change | |
---|---|---|---|---|
❌ | abund0_ani_ci0 |
560.6 µs | 672.2 µs | -16.6% |
❌ | abund0_ani_ci1 |
581.4 µs | 692.9 µs | -16.09% |
❌ | abund1_ani_ci0 |
787.2 µs | 898.7 µs | -12.41% |
❌ | abund1_ani_ci1 |
809.7 µs | 921.2 µs | -12.1% |
This PR does five things:
First, it swaps the implementation of
KmerMinHash::downsample_max_hash
withKmerMinHash::downsample_scaled
, and the same forKmerMinHashBTree
. Previously a call todownsample_scaled
calculated the rightmax_hash
fromscaled
, then calleddownsample_max_hash
, which then convertedmax_hash
back toscaled
. This reverses the logic so that (slightly) less work is done and, more importantly, the code is a bit more straightforward.Second, it changes the
downsample_*
functions so that they do not downsample when no downsampling is needed. As part of this the method signatures are changed to take an object, rather than a reference. This lets the functions return an unmodifiedKmerMinHash
when no downsampling is needed.Third, it turns out the
downsample_*
functions didn't check to make sure that the newscaled
value was larger than the old one, i.e. they didn't prevent upsampling. That check was added and a new error,CannotUpsampleScaled
, was added to sourmash core.Fourth, this uncovered a bug in
RevIndex::gather
where the query was downsampled to the match, even when the match was lower scaled. This PR rejiggers the code so that downsampling is done appropriately in thegather
andcalculate_gather_stats
. SinceRevIndex::gather
isn't used in the the sourmash CLI, the bug only presented in the test suite and in the branchwater plugin; see https://github.com/sourmash-bio/sourmash_plugin_branchwater/issues/468 and https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/467, where a fastmultigather test had to be fixed because of the incorrect scaled values output byRevIndex::gather
.Fifth, it includes https://github.com/sourmash-bio/sourmash/pull/3348 from @luizirber, which adds a
Signature::try_into()
toKmerMinHash
to support the elimination of some clones.Because of the method signature change for the
downsample_*
functions, the sourmash-core version needs to be bumped to a new major version, 0.16.0.It's been a fun journey! 😅
Fixes https://github.com/sourmash-bio/sourmash/issues/3343
Some notes on further changes and performance implications:
As a consequence of the
RevIndex::gather
changes, redundant downsampling has to be done inRevIndex::gather
andcalculate_gather_stats
, unless we want to change the method signature ofcalculate_gather_stats
. I decided the PR was big enough that I didn't want to do that in addition. It should not affect most use cases wherescaled
is the same, and we will see if it results in any slowdowns over in the branchwater plugin. See https://github.com/sourmash-bio/sourmash/issues/3196 for an issue on all of this.We could also just insist that the query scaled is the one to pay attention to, per https://github.com/sourmash-bio/sourmash/issues/2951. This would simplify the code in Python-land as well.
Overall, the performance implications of this PR are not clear. Previously downsampling was being done even when it wasn't needed, so this may speed things up quite a lot for our typical use case! On the other hand, redundant downsampling will happen in cases where there are scaled mismatches. We just need to benchmark it, I think.
Some preliminary benchmarking reported in https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/430#issuecomment-2408571735 suggests that fastgather is now much more memory effficient 🎉 so that's good!
TODO:
Err
or what if the downsampling can't be performed?