sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

update 'num' behavior, in terms of downsampling and comparison checking #870

Open ctb opened 4 years ago

ctb commented 4 years ago

see https://github.com/dib-lab/sourmash/pull/856#issuecomment-578526048 - there,

ctb

in src/core/src/sketch/minhash.rs, the if ignore_abundance code in similarity seems duplicative with the compare function; this should be resolved in some way, or at least commented. is there a reason not to have SourmashError::MismatchNum be checked in check_compatible (used by lib.kmerminhash_compare)?

luizirber:

These two are related: when I made #808 I had SourmashError::MismatchNum in check_compatible and merged similarity and compare, but that broke the code because check_compatible was not used in any of the scaled/downsampled parts of the code (which obviously don't have the same num.

ctb commented 4 years ago

well, good news is that when you add MismatchNum checking in check_compatible, you only break 6 tests, all in test__minhash.py. Not sure if that's because we're not testing different 'num' minhashes from the command line, or what!

ctb commented 4 years ago

...looks like we're not testing it via the command line 😆

ctb commented 2 years ago

https://github.com/sourmash-bio/sourmash/pull/1955 adds in some code for this downsampling.