sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

fix `MinHash.merge(...)` in rust #1446

Open ctb opened 3 years ago

ctb commented 3 years ago

In https://github.com/dib-lab/sourmash/pull/1395, I detail some fun ways to make the rust code return invalid MinHash objects because merge does Bad Things when handed mixed pairs of abundance/non-abundance sketches.

This is an issue corresponding to that PR.

ctb commented 3 years ago

As part of the larger "What shall we do with MinHash objects?" discussion, note that here we want to come up with tests and (hopefully simple) fixes so that things don't break. Down the road we might want to also think about splitting abund and noabund MinHash objects so that an explicit conversion from abund to noabund is needed, similarly to what we are talking about with num and scaled MinHash objects in https://github.com/dib-lab/sourmash/issues/1354#issuecomment-801267393.

keyabarve commented 3 years ago

@ctb I'm a little confused about what exactly needs to be done here. Could you please clarify?

ctb commented 3 years ago

hi @keyabarve the most straightforward option is to take over #1395 and fix all of the issues - basically, do systematic tests of all of the if/else branches in the Rust merge code starting in Python land, and then fix the Rust merge code to pass all of the tests. Does that make sense?

ctb commented 3 years ago

to get started, try checking out that branch with git fetch; git checkout fix/minhash_abund_merge and then do a trial removal of all of the changes to src/core/src/sketch/minhash.rs to the latest branch with git show latest:src/core/src/sketch/minhash.rs > src/core/src/sketch/minhash.rs. At that point you should see that the tests in tests/test__minhash.py don't work any more!

So your task would be to fix the code in src/core/src/sketch/minhash.rs to pass those tests, as well as seeing if there are other problems with the merge code that need fixing (as well as tests).

Note that you can get back the right version of src/core/src/sketch/minhash.rs on this branch with

git checkout src/core/src/sketch/minhash.rs