sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
455 stars 78 forks source link

FracMinHash containment to ANI conversion #1859

Open bluegenes opened 2 years ago

bluegenes commented 2 years ago

ref ANI estimation PR #1788

I've been using our forthcoming ANI utilities to estimate pairwise ANI between GTDB genomes. From these data, we can examine the average containment --> ANI relationship for a given kmer length. Note that since the number of unique k-mers in each comparison also impacts the ANI estimate, so we do not expect a single ANI value per containment value.

I've currently run family-level comparisons using k=21, scaled=1000. I'm using the average of the directional containment values ("average containment") to estimate ANI. Average containment for these comparisons ranges from 0-1, and ANI estimates range from 80%-100%.

image

@ctb suggested binning containment values so we can develop a feel for containment --> ANI. Here I've binned containment by 0.05 intervals (containment ranges from 0-1).

Note: count is the total number of pairwise genome comparisons in each bin. image

csv version of this table attached. mean-containment-k21-to-ANI.csv

ctb commented 2 years ago

and here's some code:

import csv

class ContainmentToANI_Converter:
    def __init__(self, tablefile):
        with open(tablefile, newline="") as fp:
            r = csv.DictReader(fp)
            vals = []
            for row in r:
                highbound = float(row['highContainment'])
                minANI = float(row['minANI'])
                vals.append((highbound, minANI))
            self.vals = vals

    def convert_containment(self, c):
        biggest_cont = 0
        biggest_ani = 0
        for cont, ani in self.vals:
            if c >= cont and cont > biggest_cont:
                print(c, cont, biggest_cont, biggest_ani)
                biggest_cont = cont
                biggest_ani = ani
        return biggest_ani

It relies on having a column highContainment that contains the right side of the interval in @bluegenes CSV file, so you'll need to create that (or we can beg @bluegenes to make it for us with her next export :).

Outputs:

assert x.convert_containment(0.5) == 0.95
assert x.convert_containment(0.05) == 0.8
ctb commented 2 years ago

(here's the file for k=21 suitably modified for the code above) mean-containment-k21-to-ANI.csv

bluegenes commented 2 years ago

k31, scaled 1000

These are results from gtdb representatives vs. all of gtdb. The counts are a little misleading, as there are certainly some duplicates (sigA --> sigB and sigB --> sig A are currently counted independently, will fix later). This does not affect binning, since we're just binning by 0.5 containment increments.

Note that k21 ANI values are closer to mapping-based ANI; k31 is a bit more sensitive.

gtdb-rs202 nucleotide-k31-scaled1000 containmentANI

image

mean-containment-k31-to-ANI.csv

ctb commented 2 years ago

can / should this issue be closed? @bluegenes