Open bluegenes opened 2 years ago
and here's some code:
import csv
class ContainmentToANI_Converter:
def __init__(self, tablefile):
with open(tablefile, newline="") as fp:
r = csv.DictReader(fp)
vals = []
for row in r:
highbound = float(row['highContainment'])
minANI = float(row['minANI'])
vals.append((highbound, minANI))
self.vals = vals
def convert_containment(self, c):
biggest_cont = 0
biggest_ani = 0
for cont, ani in self.vals:
if c >= cont and cont > biggest_cont:
print(c, cont, biggest_cont, biggest_ani)
biggest_cont = cont
biggest_ani = ani
return biggest_ani
It relies on having a column highContainment
that contains the right side of the interval in @bluegenes CSV file, so you'll need to create that (or we can beg @bluegenes to make it for us with her next export :).
Outputs:
assert x.convert_containment(0.5) == 0.95
assert x.convert_containment(0.05) == 0.8
(here's the file for k=21 suitably modified for the code above) mean-containment-k21-to-ANI.csv
k31, scaled 1000
These are results from gtdb representatives vs. all of gtdb. The counts are a little misleading, as there are certainly some duplicates (sigA --> sigB and sigB --> sig A are currently counted independently, will fix later). This does not affect binning, since we're just binning by 0.5 containment increments.
Note that k21 ANI values are closer to mapping-based ANI; k31 is a bit more sensitive.
can / should this issue be closed? @bluegenes
ref ANI estimation PR #1788
I've been using our forthcoming ANI utilities to estimate pairwise ANI between GTDB genomes. From these data, we can examine the average containment --> ANI relationship for a given kmer length. Note that since the number of unique k-mers in each comparison also impacts the ANI estimate, so we do not expect a single ANI value per containment value.
I've currently run family-level comparisons using k=21, scaled=1000. I'm using the average of the directional containment values ("average containment") to estimate ANI. Average containment for these comparisons ranges from 0-1, and ANI estimates range from 80%-100%.
@ctb suggested binning containment values so we can develop a feel for containment --> ANI. Here I've binned containment by 0.05 intervals (containment ranges from 0-1).
Note:![image](https://user-images.githubusercontent.com/7740734/156426653-3e677f7d-371f-4f4a-9414-c685f6d14566.png)
count
is the total number of pairwise genome comparisons in each bin.csv version of this table attached. mean-containment-k21-to-ANI.csv