Open petrelharp opened 5 years ago
Note that we might need to do something with isolated samples with a mutation above them (which are not meant to be missing). See https://github.com/tskit-dev/tskit/issues/2037#issuecomment-989304089
Should we currently error out if we detect missing data in a stats calculation, until this issue is fixed?
We probably should, putting into tskit 0.4.1
A demo example from https://github.com/hyanwong/ancestor-PCA/issues/1: you might hope to get the same pairwise information out of all these 3 tree sequences (in the last, each pair occurs in each tree)
Revisiting this, also the branch-length distance between two samples that are isolated from each other in the topology should (IMO) be NaN or infinity. But currently, e.g. two isolated samples have a distance of 0 between them:
empty_ts = tskit.Tree.generate_comb(3).tree_sequence.delete_intervals([[0, 1]])
assert empty_ts.divergence([[0],[1]], mode="branch") == 0
This presumably applies to a number of other branch-length stats too.
I'm hitting this again when working with the new 1000 genome inferred tree sequences:
import tszip
ts = tszip.decompress("1kgp-chr20p-filterTrimmedWithCpg-mm0-post-processed.trees.tsz")
print(ts.diversity(), ts.trim().diversity())
Gives the following (the second number is the correct one)
0.00035674648799140446 0.0008991582285975136
The first number is over 50% smaller than it should be because the chr20p tree sequence only covers 40% of the total sequence length. This is pretty confusing, IMO. How difficult would it be to raise an error if any of the trees have no edges?
How difficult would it be to raise an error if any of the trees have no edges?
It's not immediately obvious to me how it should be done, but you're right we should be raising an error for this.
The possibility of missing data is being added in #272; the stats need to be modified to take this into account. At least two issues:
A nice thing about the current definitions is that many stats won't need modifying the numerator to account for missing data.