Open matsen opened 1 year ago
I implemented something like this in #73, although it's slightly different:
The Jukes-Cantor ML branch length is given by $\hat\nu = -\frac{3}{4}\ln(1-\frac{4}{3}p)$, where $p = m/N$, the number of mutated sites $m$ normalized by the total number of sites $N$.
The likelihood penalty for each site, with the parent base $i$ and the child base $j$, is
P_{ij}(\nu) = \cases{ \frac{1}{4} + \frac{3}{4} e^{-4\nu/3},& i = j\\
\frac{1}{4} - \frac{1}{4} e^{-4\nu/3},& i$\neq$ j\\}
Using the ML branch length estimate, this simplifies to
P_{ij}(\hat\nu) = \cases{ 1-p,& i = j\\
\frac{1}{3}p,& i$\neq$ j\\}
so the product of these values over all sites is the likelihood penalty for a branch.
The simplest model of DNA evolution is Jukes-Cantor: https://en.wikipedia.org/wiki/Models_of_DNA_evolution#JC69_model_(Jukes_and_Cantor_1969)
The simplest weight we could put on the edges of an hDAG is to use the likelihood of the observed sequence transitions using the ML branch length for that edge. The transition matrix for this can be calculated directly as in the wikipedia article.
Then for each edge the likelihood is the likelihood of a mutation using the corresponding branch length to the power of the number of mutations, times the likelihood of no mutation to the power of the number of sites without mutation.