sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
476 stars 79 forks source link

MRG: add `Manifest::intersect_manifest` to Rust core #3305

Closed ctb closed 2 months ago

ctb commented 3 months ago

This PR implements Manifest::intersect_manifest and Collection::intersect_manifest for the Rust layer, which is needed to support standalone manifests over in https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/430.

As part of this, the PR implements Eq and Hash traits for Record so that HashSet can be used for efficient intersections.

Related PRs:

codecov[bot] commented 3 months ago

Codecov Report

Attention: Patch coverage is 70.83333% with 7 lines in your changes missing coverage. Please review.

Project coverage is 86.49%. Comparing base (26b50f3) to head (aa109f8). Report is 1 commits behind head on latest.

Files with missing lines Patch % Lines
src/core/src/manifest.rs 72.72% 6 Missing :warning:
src/core/src/collection.rs 50.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## latest #3305 +/- ## ========================================== - Coverage 86.53% 86.49% -0.04% ========================================== Files 137 137 Lines 16023 16047 +24 Branches 2728 2728 ========================================== + Hits 13865 13880 +15 - Misses 1849 1858 +9 Partials 309 309 ``` | [Flag](https://app.codecov.io/gh/sourmash-bio/sourmash/pull/3305/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=sourmash-bio) | Coverage Δ | | |---|---|---| | [hypothesis-py](https://app.codecov.io/gh/sourmash-bio/sourmash/pull/3305/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=sourmash-bio) | `25.40% <ø> (ø)` | | | [python](https://app.codecov.io/gh/sourmash-bio/sourmash/pull/3305/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=sourmash-bio) | `92.37% <ø> (ø)` | | | [rust](https://app.codecov.io/gh/sourmash-bio/sourmash/pull/3305/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=sourmash-bio) | `62.30% <70.83%> (+<0.01%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=sourmash-bio#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ctb commented 2 months ago

@luizirber @bluegenes I've got this down to something that's pleasingly simple and I'm curious what you think.

The underlying requirement is that when loading a standalone manifest, we want to intersect the new manifest with the old so as to allow subsetting of large collections - this is one of the key behaviors needed for branchwater.

In this PR, I've implemented an intersect_manifest method on Collection and Manifest that provides this behavior, and it seems to be working as intended over in https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/430 🎉 .

The main remaining question is, how do we decide if two rows are equal? For now, I am using a (name, md5) tuple to do this, which is how sourmash does it (but not for good reasons, so we could change it in the Rust code). We could switch to providing Hash and Eq traits, which would make HashSet work directly on Records and make everything simpler.

However, right now the Eq derivation on struct Record compares all fields for equality link. I would like to remove internal_location from this comparison to support standalone manifests better; I think this is all that's needed.

Thoughts? Concerns?

ctb commented 2 months ago

@luizirber @bluegenes bumping previous comment - the branchwater code is coming along nicely, but it won't really be mergeable until we figure out the sourmash side of things. Thanks!

bluegenes commented 2 months ago

The main remaining question is, how do we decide if two rows are equal? For now, I am using a (name, md5) tuple to do this, which is how sourmash does it (but not for good reasons, so we could change it in the Rust code). We could switch to providing Hash and Eq traits, which would make HashSet work directly on Records and make everything simpler.

However, right now the Eq derivation on struct Record compares all fields for equality link. I would like to remove internal_location from this comparison to support standalone manifests better; I think this is all that's needed.

I agree, Hash + Eq/Partial_Eq would be great here - better/more standardized to automatically check all the relevant info rather than just checking to name, md5sum.

ctb commented 2 months ago

@luizirber @bluegenes ready for review!

ctb commented 2 months ago

@luizirber any strong objections to the code here?

ctb commented 2 months ago

@luizirber @bluegenes can we merge this? :)

ctb commented 2 months ago

(not sure why codecov is acting up... tests do cover that code?!)

bluegenes commented 2 months ago

I'm ok with merging 🤷🏼‍♀️

luizirber commented 2 months ago

(not sure why codecov is acting up... tests do cover that code?!)

Sometimes cargo-tarpaulin doesn't get everything, I've tried #2993 some time ago, maybe time to revisit

ctb commented 2 months ago

🎉