onecodex / finch-rs

A genomic minhashing implementation in Rust
https://www.onecodex.com
MIT License
92 stars 8 forks source link

Make raw_distance function commutative #47

Closed maxbla closed 4 years ago

maxbla commented 4 years ago

I added a property test that raw_distance(sequence1, sequence2, 0.) == raw_distance(sequence2, sequence1, 0.), and it failed at first. Then I refactored the raw_distance function to have broadly similar behavior (still passes all tests), but use more idiomatic Rust, and be shorter and simpler. In doing so, I managed to make it pass the commutativity test. This doesn't exactly fix a problem (see below) so I'll understand if it is not merged.

On what case did the old raw_distance function fail?

raw_distance(&[2, 0], &[1], 0.) == (0.0, NaN, 0, 0) and raw_distance(&[1], &[2, 0], 0.) == (0.0, NaN, 0, 0). Since Nan != Nan this technically fails the commutativity test, although the practical use of that result is questionable. Also I think raw_distance is supposed to take only sorted slices as inputs (but couldn't find any documentation saying so), so the commutativity proptest might be completely invalid.

bovee commented 4 years ago

A few drive-by notes: raw_distance should only ever take sorted slices as inputs; that might be worth adding to the doc string in case it's not clear. I'm not sure NaN is necessarily wrong there, but 1 is probably better since that matches intuition (the other alternative is to throw an error, but that seems like overkill and would change the method signature).

But yeah, I think the changes make the code a lot simpler to read and the test cases handle all the regressions/weird edge cases we were worried about in 4af8ad9986bdb58b8ee2b6c96c32b911df71ad07 so this should be safe to merge.

Keats commented 4 years ago

Great @bovee I was about to ping you since I wasn't sure whether it was meant to take sorted slices! How do you feel about returning a struct rather than a tuple of 4 values?

maxbla commented 4 years ago

I think a struct (maybe called SetStastics?) would be more readable, and I like the idea. The only issue is that raw_distance is a pub fn, meaning any change to it is a breaking change (so it would also need a semver increment). I don't know what other code at One Cocex depends on finch-rs, and suspected a breaking change might, well, break it. It might make more sense to merge this PR as-is for now, and make multiple breaking changes all at once.

Keats commented 4 years ago

We rely on crates.io internally for this crate so it could be worth a 0.4 Have you seen other potential breaking changes?

maxbla commented 4 years ago

We rely on crates.io internally for this crate so it could be worth a 0.4

Good to know. In that case, a breaking change with a version bump wouldn't break properly versioned reverse dependencies.

Have you seen other potential breaking changes?

As this PR currently stands, there are no breaking changes. I would like to close #48, which would could be rolled up into a 0.4 release. I would need help to do so though because I can't create repositories in the onecodex github orginization.

maxbla commented 4 years ago

Is this ready for merge, or should I add a SetStastics struct?

Keats commented 4 years ago

Sorry, I lost track of that issue.