Investigate differences in imputation results between `tskit.lshmm` and BEAGLE

szhan commented 1 year ago

Genotype imputation using the two methods currently yield dramatically different results, with tskit.lshmm performing far worse than BEAGLE than expected.

I'm using high-coverage (>30x) individual samples (n = 876) in the unified genealogies as a case study to figure out why the imputation results are so different despite the fact that both the methods use the same underlying Li & Stephens HMM model for sample matching (albeit different implementations, obviously).

I've randomly partitioned the high-coverage samples into a mock reference panel (n = 700) and a mock target study cohort (n = 176). Also, I'm focussing on only chip-like sites (n = 7,899), which are covered by a commercially available genotyping array (see #10 ). This means that I'm subsetting both the reference panel and target cohort to only the genetic variation data at the chip-like sites.

The idea is to run tskit.lshmm to match the target samples against the reference panel under different parameters and then to see how the number of wrongly imputed sites varies. Also, the results are to be compared with those results obtained using BEAGLE.

szhan commented 1 year ago

See #26

szhan commented 1 year ago

The way that BEAGLE performs imputation is fundamentally different than initially thought. BEAGLE uses the forward-backward algorithm to compute genotype probabilities, which are then used to get the MAP allele at each non-genotyped site, instead of finding the Viterbi traceback path.

szhan / onekg_analysis

Investigate differences in imputation results between `tskit.lshmm` and BEAGLE #11