sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
477 stars 79 forks source link

ANI calculation #3170

Open Amanda-Biocortex opened 6 months ago

Amanda-Biocortex commented 6 months ago

Hi,

Could you help me to understand how QueryContainmentAni and MatchContainmentAni are calculated?

Given the use of exact kmer matches, I would assume that the ANI between a query kmer and a reference kmer would be 100%? Or is the ANI calculated between the contiguous set of kmers and the reference?

I believe 95% ANI threshold is standard- would this be the same for Sourmash ANI?

Many thanks, Amanda

ctb commented 6 months ago

hi @Amanda-Biocortex, the calculation is published here:

Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash https://genome.cshlp.org/content/33/7/1061

(preprint here: https://www.biorxiv.org/content/10.1101/2022.01.11.475870v4)

My recollection is that the calculation is based on the decay in the fraction of k-mers that match as sequences diverge.

ctb commented 6 months ago

I believe 95% ANI threshold is standard- would this be the same for Sourmash ANI?

95% is usually used for species cutoffs between two genomes. sourmash's containment ANI is (should be) directly relatable to alignment-based ANI. So, yes? :)

If you're comparing a genome to a metagenome containing multiple strains, then I think things get more complicated and interesting - it would be like you are aligning the reads to both genomes, and then calculating the best ANI match at each location, I think.