What imputation metrics to use

hyanwong commented 1 year ago

We should add a notebook that uses the unified genealogy to create genotypes, then reads those in to SGkit, then subsets to random samples and sites, adds a random call mask, runs tsinfer, and spits out some imputation metrics.

This isn't ideal, as the unified genealogy has already had a number of calls imputed, which in this case we are taking as a ground truth, but it's a start at addressing the question of what are the best imputation metrics to use to test quality of inference. Hopefully pretty much all the metrics will give the same pattern, but we'll have to see.

hyanwong commented 1 year ago

Shing says "Castedo suggested using info-theoretic measures (e.g. mutual information) to assess imputation performance"

hyanwong commented 1 year ago

See https://github.com/hyanwong/100kG-testing/blob/main/notebooks/MismatchTesting.ipynb for some code to use to test this sort of thing

hyanwong / 100kG-testing

What imputation metrics to use #1