Open psathyrella opened 9 years ago
For those of you, such as @vnminin , who might be following along, the idea is that we could have the emission probabilities be a mixture of two cases: the two sequences are derived from a common mutant from germline, or they are not. We can just assume that the common mutant from germline has a uniform base (which explains @psathyrella's "sum of four matrices").
Another thing I just realized is that if we stick with independent emissions the k-hmm is pretty easy, while if we do joint emission it'd be a collosal fisterclick.
Amen to that.
On Mon, Dec 22, 2014 at 5:38 PM, Duncan Ralph notifications@github.com wrote:
Another thing I just realized is that if we stick with independent emissions the k-hmm is pretty easy, while if we do joint emission it'd be a collosal fisterclick.
— Reply to this email directly or view it on GitHub https://github.com/psathyrella/partis/issues/29#issuecomment-67913196.
Frederick "Erick" Matsen, Assistant Member Fred Hutchinson Cancer Research Center http://matsen.fhcrc.org/
Similar mutations should indicate shared ancestry, but Duncan correctly points out that clonal lineages may not have many shared mutations if the "trunk" is short.
very similar to #175
https://github.com/psathyrella/partis/commit/2bae2ea536d283da17e3538f116a9c55a2753003
may turn out to be the better way to handle this
The basic situation
We have a per-site mutation frequency,
f
(the fraction of observed sequences that have a mutation at the site), and we want to fill in the 4x4 table of pair emission probabilities in the HMM.Independent emissions
The simplest way to do this is to make the emission probabilities a product of two factors (one for each sequence), where each factor is
f/3
(if the sequence is not germline) and1-f
(if the sequence is germline).Joint emissions
So all we need to implement joint emission is fill in the entries in the matrix so they take into account that if the two sequences are mutated to the same base, they're more likely to be clonally related. Except I haven't worked out a good way to do this. All the things I've tried require assumptions about branch lengths and tree topology which are not always true, so empirically they end up not being that great.
Erick and I talked about this a few months ago. If memory serves we got as far as he was actually convinced it was non-trivial, but didn't work out how to do it.
This is quite related to #8.