Fix HMM for > 2 alleles

jeromekelleher commented 3 years ago

The emission probas in the HMM are not computed correctly when we have > 2 alleles. We currently raise a ValueError if the user tries to force inference at such sites.

See #415 and the linked issues for background.

jeromekelleher commented 2 years ago

Need this for #675

Check for tests in referring to #415 and enable.

hyanwong commented 2 years ago

I see that I raised this issue originally in https://github.com/tskit-dev/tsinfer/issues/403. I'm happy to revisit it for this PR.

hyanwong commented 2 years ago

A summary is helpful here, I think. Internally we use recombination probabilities, calculated from a recombination rate combined with a the distance between sites (e.g. here). This calculated recombination probability is passed to the internal algorithm (as rho) and then used to calculate the transition probabilities, p_recomb and p_no_recomb e.g. here:

https://github.com/tskit-dev/tsinfer/blob/129034430ce319b8848104f4a2e626ac69ba9cf1/tsinfer/algorithm.py#L736

We also specify mismatch probabilities (mu, often equated with the mutation probability in L&S HMMs). In tsinfer these are calculated by combining the mismatch ratio with the median recombination probability. We convert the mismatch probability into an emission probability e via the Rosen & Paton parameterization (their eqn 4): e = 1-(A-1) * µ if the hidden state is the same as the emitted state, or e=µ if the hidden state is different from the emitted state, here:

https://github.com/tskit-dev/tsinfer/blob/129034430ce319b8848104f4a2e626ac69ba9cf1/tsinfer/algorithm.py#L745

For A=2 alleles, the emission probs are therefore 1-mu (if hidden and emitted states are the same) and mu (if different), This has the desirable feature that if mu=0, the probs are 1 and 0, or if mu=1 the probs are 0 and 1. We decided in https://github.com/tskit-dev/tsinfer/issues/403#issuecomment-741869133 that we were happy with this definition of mismatch: i.e. if mu=1 it forces the hidden state to be different from the emitted state (this contrasts to the treatment in Donnelly & Leslie, eqn 4.2, where an infinite mutation rate results in the emitted state having an equal probability of being any of the allelic states).

The problem with this particular parameterisation is that with a high mismatch probability (mu=1) and more than 2 alleles, the emission probability becomes less than zero! So effectively, the Rosen & Paton parameterization requires mu to be less than 1/(A-1). Another way of thinking about this is that it treats the "mismatch probability" as the probability of mutation from one allele to a specific other allele, rather than to any other allele. This is the same in the Lunter parameterization, where A is hardcoded to 4, so that his mutation rate mu can never be set above 1/3.

I suggest that for L&S matching with >2 alleles we don't want to have this particular parameterization. I suggest 2 possibilities:

We could treat all non-matching alleles as identical for the purpose of matching (i.e. partition the alleles into match/non-match, rather than a set of allelic states). That means we simply have 1-mu and mu as the emission probabilities, and don't bother incorporating the number of alleles at all.
We could switch to the Donnelly & Leslie parameterization, and have mu=1 meaning that the emission probabilities are all equal (e.g. 1/num_alleles) and mu=0 meaning that the emission probabilities are 1 and 0, so something like 1/num_alleles * mu and 1/num_alleles + (1-mu) * (num_alleles - 1)/num_alleles (I think this is the equivalent to Donnelly & Leslie's eqn 4.2 but with mu going from 0..1 rather than theta going from 0..infinity).

Note that @astheeggeggs has thought about this, for example at https://github.com/astheeggeggs/lshmm/blob/9b15d8417e2856e13b28d59b4c8d078068fd8c0b/lshmm/api.py#L142 so we should ask him to look at any reparameterisation we do here to check that it makes sense.

tskit-dev / tsinfer

Fix HMM for > 2 alleles #437