Open hyanwong opened 1 year ago
And here are the correlations between the known lengths of root nodes and what we infer (it's a pretty poor correlation, though!)
rb = np.array(root_breaks)
mid_root_pos = rb[:-1] + np.diff(rb)/2
ss = np.searchsorted(rb, mid_root_pos)
plt.scatter(np.diff(root_breaks), rb[ss] - rb[ss-1])
rb = np.array(r2)
ss = np.searchsorted(rb, mid_root_pos)
plt.scatter(np.diff(root_breaks), rb[ss] - rb[ss-1], alpha=0.1)
print(
"corr coeff: known root lengths vs lengths with split ultimate:\n ",
np.corrcoef(np.diff(root_breaks), rb[ss] - rb[ss-1])[0, 1])
rb = np.array(r3)
ss = np.searchsorted(rb, mid_root_pos)
plt.scatter(np.diff(root_breaks), rb[ss] - rb[ss-1], alpha=0.1)
print(
"corr coeff: known root lengths vs lengths with extra split root:\n ",
np.corrcoef(np.diff(root_breaks), rb[ss] - rb[ss-1])[0, 1])
plt.xscale('log')
plt.yscale('log')
corr coeff: known root lengths vs lengths with split ultimate:
0.06516027384592456
corr coeff: known root lengths vs lengths with extra split root:
0.13137918764309806
Extra splitting of the root certainly improves the n=10 plot from @a-ignatieva's ppreprint, especially when combined with @nspope's variational gamma method:
And here for 100 samples. Since these use exactly the same topology, the improvement can't be anything to do with e.g. better polytomy breaking.
@jeromekelleher and I decided this should be implemented at a minimum for post_process
, and then probably rolled out as the default. However, it would be good to think of a more efficient method that the one coded above, and also a method that keeps the nodes in time-order (this might have to be done with a sort at the end, though)
A more justified model-based method to cutting up the root nodes is to implement the PSMC-on-the-tree idea for the root. If this is implemented, then it's possible that we should use that to cut up the root nodes instead. So there's an argument for making the version above only available as a non-default post-process option.
On the basis that the ultimate ancestor is not biologically very plausible, in recent version of tsinfer we now cut up edges that led direct to the ultimate ancestor, by running the new post_process routine.
However, I suspect (and tests show) that we still make root ancestors that are too long. Therefore we could think about cutting up not just the ultimate ancestor, but also any root in which the edges-in or the edges-out change.
Here's some example code, with a histogram of actual edge spans of the root node. Note that this code may result in nodes that are not ordered strictly by time.