sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
475 stars 80 forks source link

Properly document/warning users about ANI b/t translated sequences and protein. #2010

Open ctb opened 2 years ago

ctb commented 2 years ago

tl;dr Yay ANI! https://github.com/sourmash-bio/sourmash/pull/1967 Boo ANI on translated sequences unless max_containment used :(.

Longer backstory:

For translated DNA x protein, we will have many spurious proteins (unless we use orpheum :)).

Thus the Jaccard will be very different from the containment which will be very different from the max containment.

Thus the Jaccard ANI will be very different from the containment ANI which will be very different from the max containment ANI.

(It is even worse for translated DNA x translated DNA, where there is no solution, I think.)

@bluegenes sez:

This is the optimal use case for --max-containment, as long as the 6-frame translated sketch is larger than the protein sketch it is being compared with (Jaccard ANI and containment of 6-frame translated --> ANI will be wrong). 6frame x 6frame translated ANI will also be wrong. Perhaps we need a good bit of documentation on this, or a warning about using translated sketches..

bluegenes commented 2 years ago

Just a thought --

Beyond general warnings /documentation, we could also warn about this (& prevent translate x translate ANI) if we stored a property in the MinHash class that described whether or not the sketch was generated via translation. ref #268