Open hyanwong opened 5 months ago
another consideration is that a polytomy implies a greater total edge span than the original binary topologies; which I'd think would introduce bias. IIRC we don't see a relationship between arity and bias, however.
Now that we have a decent routine to create polytomies (https://github.com/tskit-dev/tskit/discussions/2926), we can test out the effect of polytomies on dating. Here's an example, using the true topologies, without and with induced polytomies (where edges without a mutation on them are removed to make a polytomy, see the trees below the plot). It appears as if making polytomies like this biases the mutation dates to younger times as the sample size increases:
Example first tree (original, then with induced polytomies):
FWIW, the pattern doesn't change much if we use the metadata-stored mutation times instead.
Nice!
This is great @hyanwong, thanks. I can think of a few things to try that might reduce bias-- will report back.
Thanks @nspope : from a few tests it appears as if the bias is less pronounced in tsinfer inferred tree sequences. Plots below - the right hand column is tsinfer on the same data:
As an aside, I wondered if reducing to the topology only present at each variable site would change the bias, but it doesn't seem to very much
Looking first at node ages ... the reason there's bias in dating nodes after introducing polytomies is because there's more mutational area than was in the original binary trees. E.g. we're increasing the total branch length, which means that when we match moments using segregating sites we end up shrinking the timescale.
To be a bit more precise: the current normalisation strategy calculates total edge area and total number of mutations, then rescales time such that the expected number of mutations matches the total number of mutations.
Instead, consider doing the following: for each tree, sample a path from a randomly selected leaf to the root. Only accumulate edge area and mutations on the sampled paths. This should be unbiased, because the "path length" is the same regardless of the presence of polytomies. In fact, we can do this sampling deterministically, because the probability that a randomly selected path passes through a given edge is proportional to the number of samples subtended by that edge. E.g. we normalise as before but weight edges by the number of samples they subtend.
Using this alternative "path normalisation" strategy seems to greatly help with bias (1000 samples, 10 Mb):
This more-or-less carries over for mutations:
Oh wow. This is amazing. Thanks Nate.
Does it cause any overcorrection problems for tsinferred tree sequences? I assume it shouldn't...
Another way to phrase this is that we're moment matching against a different summary statistic (rather than segregating sites), that is the expected number of differences between a single sample and the root. In my opinion this choice of summary statistic is a more conceptually straightforward way to measure time with mutational density.
I did a quick check on inferred simulated tree sequences -- the original routine was more or less unbiased (as Yan observed above) and the new routine does about the same. Would be interesting to compare the two on real data. Regardless, this new routine seems like the right approach.
the new routine does about the same
That's great.
Regardless, this new routine seems like the right approach.
Absolutely. We should go with the new approach. I wonder how both approaches perform on reinference? I can check this once there's instructions for how to run the new version.
The API is exactly the same, with the new normalisation scheme used by default. The old normalisation scheme can be toggled if you pass match_segregating_sites=True
to date.
Great, thanks for the info. Is it currently much slower than the old version? It seems maybe not?
It shouldn't be, but would be good to check (if you enable logging it'll print out time spent during normalisation). There's an additional pass over edges, but this is done in numba. It might add few minutes on GEL or UKBB sized data, so would be good to enable logging there to get a sense for the overhead.
I wonder how both approaches perform on reinference
Actually, I don't think it'll change reinference at all -- ancestor building just uses the ordering of mutations, right? Normalisation won't change the order, just the inter-node time differences.
Hannes had an interesting idea: how do polytomies affect the variation in posterior times for a node? We could test this by taking a known topology and collapsing some of the nodes into polytomies, then dating, and looking at how the posterior distribution of times of the component nodes compares to the posterior distribution estimated for the collapsed polytomy.