Open ctb opened 2 years ago
Yup, that's what I would suspect too. I can't seem to find it now, but I recall that plot you had of containment vs k-mer size and ANI vs k-mer size: while the similarity decreases rapidly, the ANI stays about the same. The 1-ANI = containment^(1/k) appears to be causing this when k is any appreciable size.
thanks!
I think you're referring to this issue and graph - https://github.com/sourmash-bio/sourmash/issues/267#issuecomment-1120024387
Yup, that's the one!
From a philosophical perspective, I think this is good evidence that ANI is more interpretable than containment/similarity. How large of a containment value is considered "big" depends quite a bit on the k-size. So even though the first plot (sourmash compare -o cmp podar-ref/*.fa.sig -k 21
) makes it seem like all the organisms are distinct-ish, the containment values of 0.001whatever are actually "significant"
thanks, that was my intuition as well! better representing small values in comparison matrices came up over in https://github.com/sourmash-bio/sourmash/issues/491, too, although I think the interpretation of ANI-between-metagenomes is ...tricky ;).
I'll leave this open for a bit so others can comment, and close it next time I do an issues sweep!
while working on https://github.com/sourmash-bio/sourmash/pull/2225 and making sure that
plot
will work with both similarity and distance matrices, I ran a few comparisons with the 64 genomes from podar-ref (sequences here).When I run:
I get:
good so far!
when I add ANI,
I get:
which is a lot busier! I think this reflects the fact that translation to ANI involves a logarithmic transformation step from the Jaccard, and so even really low similarities etc pop up into view with the ANI.
Is this an ok way to think about things, @bluegenes @dkoslicki? Is there more going on?