Allowing fuzzy probabilities for ancestral state assignment, using an HMM

@nspope had a great idea for a tsinfer-like method to identify mispolarised ancestral states. This could form the basis of a comprehensive recombination-aware ancestral state polariser. Note that existing software like EST-SFS are not really recombination aware, so this could be aiming at an open goal.

Here's my summary:

We assign probabilities rather than binary 0/1 to the ancestral states
As in tsinfer, we create ancestral haplotypes on the basis of focal sites. Here, we create two ancestral haplotypes (say A and B) for each focal site, one for each polarisation, and assign them some probability. For example, if a bialleleic site has 3 samples with a T and 7 with a G, we create one haplotype at frequency 0.3 with derived state T, and another at frequency 0.7 with derived state G. We might want to normalise the probabilities so that the most likely one has "probability" 1.
When we build the haplotypes, we take account of the probabilities in adjacent sites using an HMM / dynamic programming which integrates over the probabilities at each adjacent site. We break haplotype reconstruction when the likelihoods have got sufficiently small.
We match haplotypes as normal using tsinfer, but account for the fuzzy probabilities rather than using binary 0/1 membership at each site
We create a composite score where we require haplotypes to that match to A or B to match either to all A or all B. The one with the highest score is the correct polarisation.

This essentially picks the polarizations that minimise the number of recombinations required. Note that because we are storing probabilities of the 2 states for each ancestral haplotype at each site, this would not be as scalable as tsinfer, so we would need to run it on a subset of the data.

@jeromekelleher points out that real (e.g. UKBB) inferences have a huge number of sites (and also/therefore a huge number of ancestors), even for relatively small sample sizes. So running an HMM which stores floating point probabilities along the entire genome for each ancestor might be prohibitive.

However, the information used to detect errors / bad ancestral alleles is fairly local to the region being inferred, so it might be possible to detect this somehow using a moving window along the genome? We would need to think fairly carefully about this.

It might also be possible to think of a way to use this in a "normal" Li & Stephens-like algorithm, in which each haplotype is successively matched against a panel of all the other haplotypes. It's unclear to me how the ancestral state would come into this, through.

tskit-dev / tsinfer

Allowing fuzzy probabilities for ancestral state assignment, using an HMM #863