Deal with missing data in stats

tskit-dev / tskit

Population-scale genomics

MIT License

155 stars 73 forks source link

Deal with missing data in stats #287

Open petrelharp opened 5 years ago

petrelharp commented 5 years ago

The possibility of missing data is being added in #272; the stats need to be modified to take this into account. At least two issues:

Site stats: isolated samples will be assumed to have the ancestral allele, wrongly.
All stats: the denominators should count up the number of nonmissings.

A nice thing about the current definitions is that many stats won't need modifying the numerator to account for missing data.

hyanwong commented 2 years ago

Note that we might need to do something with isolated samples with a mutation above them (which are not meant to be missing). See https://github.com/tskit-dev/tskit/issues/2037#issuecomment-989304089

hyanwong commented 2 years ago

Should we currently error out if we detect missing data in a stats calculation, until this issue is fixed?

jeromekelleher commented 2 years ago

We probably should, putting into tskit 0.4.1

hyanwong commented 1 year ago

A demo example from https://github.com/hyanwong/ancestor-PCA/issues/1: you might hope to get the same pairwise information out of all these 3 tree sequences (in the last, each pair occurs in each tree)

hyanwong commented 1 year ago

Revisiting this, also the branch-length distance between two samples that are isolated from each other in the topology should (IMO) be NaN or infinity. But currently, e.g. two isolated samples have a distance of 0 between them:

empty_ts = tskit.Tree.generate_comb(3).tree_sequence.delete_intervals([[0, 1]])
assert empty_ts.divergence([[0],[1]], mode="branch") == 0

This presumably applies to a number of other branch-length stats too.

hyanwong commented 4 months ago

I'm hitting this again when working with the new 1000 genome inferred tree sequences:

import tszip
ts = tszip.decompress("1kgp-chr20p-filterTrimmedWithCpg-mm0-post-processed.trees.tsz")
print(ts.diversity(), ts.trim().diversity())

Gives the following (the second number is the correct one)

0.00035674648799140446 0.0008991582285975136

The first number is over 50% smaller than it should be because the chr20p tree sequence only covers 40% of the total sequence length. This is pretty confusing, IMO. How difficult would it be to raise an error if any of the trees have no edges?

jeromekelleher commented 4 months ago

How difficult would it be to raise an error if any of the trees have no edges?

It's not immediately obvious to me how it should be done, but you're right we should be raising an error for this.