sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

implement pairwise evolutionary distance/ANI output from sourmash #1242

Closed ctb closed 2 years ago

ctb commented 3 years ago

what this means in practical concerns is not clear, but...

cc @bluegenes

ctb commented 3 years ago

after talking with @bluegenes - it would be particularly great to have it in compare, because (a) it makes more sense than what we have now, and (b) it should allow us to compare matrices from different ksizes.

it doesn't really make sense in gather or containment, but might make sense in search.

bluegenes commented 3 years ago

Ref Criscuolo 2020, which builds on the distance originally proposed in the Mash (Ondov 2016) paper. image

"Parameters b1 and b2 can be defined according to explicit models to estimate the number d of nucleotide substitutions per character that have occurred during the evolution of the sequences x and y, e.g. 15–24. When b1 = b2 = 1, Equation (2) corresponds to the Poisson correction (PC; e.g. 21) distance. Although it is based on a simplistic model of nucleotide substitution1,16,25,26, PC is the p-distance transformation implemented in many MH tools (e.g. Mash, Dashing, FastANI, Kmer-db, BinDash). Among these models, equal-input (EI, sometimes called F8118,19,24,27–29) takes into account the equilibrium frequency πr of each residue r in Σ = {A, C, G, T}. An EI distance can be estimated using Equation (2) with b1=1−∑r∈Σπ2r and b2=1−∑r∈Σπrxπry, where πrx and πry are the frequencies of r in the two sequences x and y, respectively20. Further assuming that the heterogeneous replacement rates among nucleotide pairs and sites can be modelled with a Γ distribution, an EI distance d can be derived from p using the following formula:"

image

"where a > 0 is the (unknown) shape parameter of the Γ distribution, e.g. 22,24,30–33. It is worth noticing that when a is high, Equation (2) and Equation (3) yield very similar distance estimates (for any fixed b1 and b2)."

Other relevant references: Criscuolo 2019 (JolyTree) and Rohling 2020

Here's some code to convert a jaccard matrix from compare to pairwise evolutionary distance matrix (I haven't yet implemented the F81-like correction (3rd eqn) - will update with that soon!

def jaccard_to_evoldist(jaccard, ksize, b1=1.0, b2=1.0):
    # proportion of observed differences (jaccard of 1.0 returns 0; jaccard 0 can't be calculated)
    if jaccard ==0:
        return 1.0 # or just drop these out of the comparison?
    p = 1 - np.power(2*jaccard/(jaccard + 1),(1/float(ksize)))
    # corrected evolutionary distance
    d = -(b1*np.log((1-p)/b2))
    return d

def main(args):
    jaccard_dists = np.load(open(args.jaccard_np, 'rb'))
    jaccard_to_evoldist_vectorized = np.vectorize(jaccard_to_evoldist) # necessary?
    evoldists = jaccard_to_evoldist_vectorized(jaccard_dists, args.ksize)
    with open(args.evoldist_np, 'wb') as fp:
        np.save(fp, evoldists)

currently at https://github.com/bluegenes/2020-dist-compare/blob/main/jaccard-evoldist.py.