tskit-dev / tsinfer

Infer a tree sequence from genetic variation data.
GNU General Public License v3.0
56 stars 13 forks source link

Allowing fuzzy probabilities for ancestral state assignment, using an HMM #863

Open hyanwong opened 1 year ago

hyanwong commented 1 year ago

@nspope had a great idea for a tsinfer-like method to identify mispolarised ancestral states. This could form the basis of a comprehensive recombination-aware ancestral state polariser. Note that existing software like EST-SFS are not really recombination aware, so this could be aiming at an open goal.

Here's my summary:

This essentially picks the polarizations that minimise the number of recombinations required. Note that because we are storing probabilities of the 2 states for each ancestral haplotype at each site, this would not be as scalable as tsinfer, so we would need to run it on a subset of the data.

hyanwong commented 1 year ago

@jeromekelleher points out that real (e.g. UKBB) inferences have a huge number of sites (and also/therefore a huge number of ancestors), even for relatively small sample sizes. So running an HMM which stores floating point probabilities along the entire genome for each ancestor might be prohibitive.

However, the information used to detect errors / bad ancestral alleles is fairly local to the region being inferred, so it might be possible to detect this somehow using a moving window along the genome? We would need to think fairly carefully about this.

It might also be possible to think of a way to use this in a "normal" Li & Stephens-like algorithm, in which each haplotype is successively matched against a panel of all the other haplotypes. It's unclear to me how the ancestral state would come into this, through.