Open jeromekelleher opened 3 years ago
Need this for #675
Check for tests in referring to #415 and enable.
I see that I raised this issue originally in https://github.com/tskit-dev/tsinfer/issues/403. I'm happy to revisit it for this PR.
A summary is helpful here, I think. Internally we use recombination probabilities, calculated from a recombination rate combined with a the distance between sites (e.g. here). This calculated recombination probability is passed to the internal algorithm (as rho
) and then used to calculate the transition probabilities, p_recomb
and p_no_recomb
e.g. here:
We also specify mismatch probabilities (mu
, often equated with the mutation probability in L&S HMMs). In tsinfer
these are calculated by combining the mismatch ratio with the median recombination probability. We convert the mismatch probability into an emission probability e
via the Rosen & Paton parameterization (their eqn 4): e = 1-(A-1) * µ
if the hidden state is the same as the emitted state, or e=µ
if the hidden state is different from the emitted state, here:
For A=2 alleles, the emission probs are therefore 1-mu
(if hidden and emitted states are the same) and mu
(if different), This has the desirable feature that if mu=0
, the probs are 1 and 0, or if mu=1
the probs are 0 and 1. We decided in https://github.com/tskit-dev/tsinfer/issues/403#issuecomment-741869133 that we were happy with this definition of mismatch: i.e. if mu=1
it forces the hidden state to be different from the emitted state (this contrasts to the treatment in Donnelly & Leslie, eqn 4.2, where an infinite mutation rate results in the emitted state having an equal probability of being any of the allelic states).
The problem with this particular parameterisation is that with a high mismatch probability (mu=1
) and more than 2 alleles, the emission probability becomes less than zero! So effectively, the Rosen & Paton parameterization requires mu
to be less than 1/(A-1)
. Another way of thinking about this is that it treats the "mismatch probability" as the probability of mutation from one allele to a specific other allele, rather than to any other allele. This is the same in the Lunter parameterization, where A is hardcoded to 4, so that his mutation rate mu
can never be set above 1/3.
I suggest that for L&S matching with >2 alleles we don't want to have this particular parameterization. I suggest 2 possibilities:
1-mu
and mu
as the emission probabilities, and don't bother incorporating the number of alleles at all.mu=1
meaning that the emission probabilities are all equal (e.g. 1/num_alleles
) and mu=0
meaning that the emission probabilities are 1 and 0, so something like
1/num_alleles * mu
and 1/num_alleles + (1-mu) * (num_alleles - 1)/num_alleles
(I think this is the equivalent to Donnelly & Leslie's eqn 4.2 but with mu going from 0..1 rather than theta going from 0..infinity). Note that @astheeggeggs has thought about this, for example at https://github.com/astheeggeggs/lshmm/blob/9b15d8417e2856e13b28d59b4c8d078068fd8c0b/lshmm/api.py#L142 so we should ask him to look at any reparameterisation we do here to check that it makes sense.
The emission probas in the HMM are not computed correctly when we have > 2 alleles. We currently raise a ValueError if the user tries to force inference at such sites.
See #415 and the linked issues for background.