sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
473 stars 80 forks source link

some questions about ANI and `compare` #2226

Open ctb opened 2 years ago

ctb commented 2 years ago

while working on https://github.com/sourmash-bio/sourmash/pull/2225 and making sure that plot will work with both similarity and distance matrices, I ran a few comparisons with the 64 genomes from podar-ref (sequences here).

When I run:

sourmash compare -o cmp podar-ref/*.fa.sig -k 21 
sourmash plot cmp

I get:

Screen Shot 2022-08-20 at 10 59 28 AM

good so far!

when I add ANI,

sourmash compare -o cmp podar-ref/*.fa.sig -k 21 --ani
sourmash plot

I get:

Screen Shot 2022-08-20 at 11 00 25 AM

which is a lot busier! I think this reflects the fact that translation to ANI involves a logarithmic transformation step from the Jaccard, and so even really low similarities etc pop up into view with the ANI.

Is this an ok way to think about things, @bluegenes @dkoslicki? Is there more going on?

dkoslicki commented 2 years ago

Yup, that's what I would suspect too. I can't seem to find it now, but I recall that plot you had of containment vs k-mer size and ANI vs k-mer size: while the similarity decreases rapidly, the ANI stays about the same. The 1-ANI = containment^(1/k) appears to be causing this when k is any appreciable size.

ctb commented 2 years ago

thanks!

I think you're referring to this issue and graph - https://github.com/sourmash-bio/sourmash/issues/267#issuecomment-1120024387

dkoslicki commented 2 years ago

Yup, that's the one!

dkoslicki commented 2 years ago

From a philosophical perspective, I think this is good evidence that ANI is more interpretable than containment/similarity. How large of a containment value is considered "big" depends quite a bit on the k-size. So even though the first plot (sourmash compare -o cmp podar-ref/*.fa.sig -k 21) makes it seem like all the organisms are distinct-ish, the containment values of 0.001whatever are actually "significant"

ctb commented 2 years ago

thanks, that was my intuition as well! better representing small values in comparison matrices came up over in https://github.com/sourmash-bio/sourmash/issues/491, too, although I think the interpretation of ANI-between-metagenomes is ...tricky ;).

I'll leave this open for a bit so others can comment, and close it next time I do an issues sweep!