with sourmash v4.4, sourmash can now estimate Average Nucleotide Identity between genomes: this is the fraction of bases that would be identical in a pairwise alignment. However, sourmash estimates this based on k-mers, instead of using alignments. This is fast and lightweight, and doesn't need access to the full genomes!
Let's use ANI with sourmash compare and sourmash plot to look at the relationship between 12 E. coli genomes.
ANI estimates are somewhat dependent on the software and parameters used to calculate them. We are working on a systematic comparison of sourmash's ANI estimates with other ANI software!
sourmash ANI estimates are only available for scaled signatures (the default, when signatures are generated with sourmash sketch).
In order to accurately estimate ANI, sourmash signatures need to have enough hashes for the calculation; this is dependent on both the size of the genome(s) and the scaled factor used to generate the signatures. sourmash will output warnings to stderr when the sketches are too small to accurately estimate ANI.
@ctb I'm looking into some strange discrepancies between the different ways ANI is being calculated (see here). But in spirit, I think this is a great example!
with sourmash v4.4, sourmash can now estimate Average Nucleotide Identity between genomes: this is the fraction of bases that would be identical in a pairwise alignment. However, sourmash estimates this based on k-mers, instead of using alignments. This is fast and lightweight, and doesn't need access to the full genomes!
Let's use ANI with
sourmash compare
andsourmash plot
to look at the relationship between 12 E. coli genomes.First, you'll need to download the GTDB genomic-reps database, containing 66k genomes, as in https://github.com/sourmash-bio/sourmash-examples/issues/13.
Then, run
compare
, selecting just theEscherichia
genomes:This creates a numpy comparison matrix in
ecoli-ani.cmp
(you can also generate a CSV output with--csv
).Now, use
plot
to quickly visualize the comparison matrix:This will produce an image
ecoli-ani.cmp/matrix.png
:This plot shows the clade structure of the 12
E. coli
genomes, including three that have no detectable ANI similarity to any other ANI genomes.The genome names corresponding to the label numbers can be found in the
ecoli-ani.cmp.labels.txt
file.Caveats and details
ANI can be estimated by sourmash from ~85% to 99%, depending on the k-mer size used. See https://github.com/sourmash-bio/sourmash/issues/1859 for some numbers for k=21 and k=31.
ANI estimates are somewhat dependent on the software and parameters used to calculate them. We are working on a systematic comparison of sourmash's ANI estimates with other ANI software!
sourmash ANI estimates are only available for scaled signatures (the default, when signatures are generated with
sourmash sketch
).In order to accurately estimate ANI, sourmash signatures need to have enough hashes for the calculation; this is dependent on both the size of the genome(s) and the scaled factor used to generate the signatures. sourmash will output warnings to stderr when the sketches are too small to accurately estimate ANI.
Some details on how sourmash estimates ANI are in the sourmash docs.
Citations
The analytical work underlying the ANI calculations is introduced in Debiasing FracMinHash and deriving confidence intervals for mutation rates across a wide range of evolutionary distances, Hera et al., 2022.