sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

settle on core MinHash object API #885

Open ctb opened 4 years ago

ctb commented 4 years ago

PR https://github.com/dib-lab/sourmash/pull/882 raised a question in my mind about the various similarity and comparison methods on MinHash objects. We have similarity, compare, and jaccard, as well as contained_by and containment_ignore_maxhash. Currently compare and jaccard are identical (and do Jaccard comparisons without hash abundances), while similarity calculates Jaccard similarity w/o abundances and calculates angular similarity with abundances.

The source code for this is a bit of a mess, obviously :).

SEPARATELY, I just realized that the sourmash gather code in sourmash/search.py does abundance calculations in Yet Another Way, with what I'll call a projection from an abundance-weighted signature on to a non-abundance-weighted signature (see block in search.py). This code is entirely absent from the MinHash code. sigh.

Anyway.

882 makes the rust code a bit simpler and less repetitive, and also IMO clearer, by adding explicit angular_similarity and jaccard functions (that do those calculations without downsampling), and then doing the downsampling and so on in similarity and compare. This raises the following questions --

Should the Python API match the rust API? (probably yes - right now this means adding angular_similarity.)

should we get rid of redundant functions (esp ones that no one uses outside of the internal sourmash team :) and document the official API? (probably yes.)

should we put "buyer beware" in around things like angular similarity and projection similarity that we haven't proven work, with either simulations or analytics? (...probably yes.)

related to #866 #624

ctb commented 4 years ago

@luizirber what's the right way to use deprecated to indicate that the Python MinHash::compare function will be removed in the future? And when's the earliest we can remove it - v4, or v5?

ctb commented 4 years ago

I'm going to try to follow the pandas deprecation policy (also ref their versioning policy)

ctb commented 4 years ago

We should write Python MinHash object API tests/docs as part of this, esp now that #889 is merged and life is looking a bit better.

luizirber commented 4 years ago

We should write Python MinHash object API tests/docs as part of this, esp now that #889 is merged and life is looking a bit better.

And probably use similar docs for the rust docs, they are pretty empty right now...

ctb commented 4 years ago

would also be good to revisit issues like https://github.com/dib-lab/sourmash/issues/618 to see that we're at least consistent.

ctb commented 3 years ago

random side thought: I think we should redefine the MinHash constructor API like so:

def __init__(self, *, num=0, scaled=0, ksize=None, ...)

(ref #338 of course)