matsengrp / historydag

https://matsengrp.github.io/historydag
GNU General Public License v3.0
0 stars 1 forks source link

weighting edges of the DAG using Jukes-Cantor #70

Open matsen opened 1 year ago

matsen commented 1 year ago

The simplest model of DNA evolution is Jukes-Cantor: https://en.wikipedia.org/wiki/Models_of_DNA_evolution#JC69_model_(Jukes_and_Cantor_1969)

The simplest weight we could put on the edges of an hDAG is to use the likelihood of the observed sequence transitions using the ML branch length for that edge. The transition matrix for this can be calculated directly as in the wikipedia article.

Then for each edge the likelihood is the likelihood of a mutation using the corresponding branch length to the power of the number of mutations, times the likelihood of no mutation to the power of the number of sites without mutation.

willdumm commented 1 year ago

I implemented something like this in #73, although it's slightly different:

The Jukes-Cantor ML branch length is given by $\hat\nu = -\frac{3}{4}\ln(1-\frac{4}{3}p)$, where $p = m/N$, the number of mutated sites $m$ normalized by the total number of sites $N$.

The likelihood penalty for each site, with the parent base $i$ and the child base $j$, is

P_{ij}(\nu) = \cases{ \frac{1}{4} + \frac{3}{4} e^{-4\nu/3},& i = j\\
\frac{1}{4} - \frac{1}{4} e^{-4\nu/3},& i$\neq$ j\\}

Using the ML branch length estimate, this simplifies to

P_{ij}(\hat\nu) = \cases{ 1-p,& i = j\\
\frac{1}{3}p,& i$\neq$ j\\}

so the product of these values over all sites is the likelihood penalty for a branch.