Open hyanwong opened 5 years ago
The fundamental problem is that a consequence of the tree sequence data format is that you inevitably lose some of the edges when a recombination occurs and this leads the recombination clock to underestimate node age. This problem can be managed in the n=2 case (this is how GEVA works), but our current thinking is that it is an intractable problem in n=3 and above... or maybe not! I think it'll be worth bringing this up again with Gil
There's an interesting application of this to the phasing singletons issue in #374. We could probably test (with real data), how much extra information there is in the haplotype lengths (i.e. the recombination clock) for phasing singletons, once the tsdated branch lengths are taken into account. The haplotype lengths were essentially what was used by Platt et al in Jody Hey's lab for phasing singletons (the inspiration for Nate's singleton phasing idea)
Since any analysis of singletons is only based on data present at the very tips of the trees, we would be testing the "best case" scenario, where we might expect the recombination clock to be pretty valuable for estimating dates.
Essentially this is like trying to predict the phase (mum or dad) from (a) the difference in tsdated times between the samples in the individual and the two different parent nodes and (b) the difference in span of the edge between the samples in the individual and the two different parent nodes.
We can then do a logistic regression of phase on (a) and (b). You might also expect the average branch length to play a role in the robustness of the model, so I have done an analysis in R with that added that as a covariate, and also put in the interaction terms. I've used log dates and log spans too.
This is only a rough way to do the analysis. A better way would be to use the theoretically expected IBD lengths (like in Platt et al) to calibrate the expected contribution of haplotype length on phasing. But I think it's good enough to get a feel for the relative magnitude of contributions.
df <- read.csv("analysis_output.csv")
model <- glm(TruePhase ~ difflogtime * meanlogtime * logspandiff, data=df, family="binomial")
anova(model)
Gives:
Analysis of Deviance Table
Model: binomial, link: logit
Response: TruePhase
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 119207 165246
difflogtime 1 19540.4 119206 145705 < 2.2e-16 ***
meanlogtime 1 8.8 119205 145697 0.002975 **
logspandiff 1 5365.7 119204 140331 < 2.2e-16 ***
difflogtime:meanlogtime 1 778.3 119203 139552 < 2.2e-16 ***
difflogtime:logspandiff 1 1.6 119202 139551 0.204540
meanlogtime:logspandiff 1 199.3 119201 139352 < 2.2e-16 ***
difflogtime:meanlogtime:logspandiff 1 1.0 119200 139351 0.322044
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In other words, we can probably increase the predictive accuracy of the phasing algorithm by about 25% more than we currently have (deviance of logspandiff
: 5365.7 compared to that of difflogtime
: 19540.4), by also incorporating the length of the haplotype into the calculation.
This gives a (very) rough idea of how much extra information we have in the recombination clock in the best-case situation of "recent" variants. However, as "recent" is relative to the number of samples in the dataset, I suspect that haplotype length will provide more information than this as the number of samples increases to tens or hundreds of thousands (we should test this when we have the GEL data).
It would be easy to test using simulations how this proportion is changed as the number of samples changes. Also, FWIW, here's the ANOVA table when logspandiff
is first. So it looks like there is about the same amount of information (actually a tad more) in the haplotype length as we are using in the node age differences. But (as expected), the majority of information is shared between haplotype length and tsdate-estimated time..
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 119207 165246
spandiff 1 20991.6 119206 144254 < 2.2e-16 ***
difflogtime 1 3886.5 119205 140368 < 2.2e-16 ***
meanlogtime 1 7.0 119204 140361 0.008256 **
spandiff:difflogtime 1 4.1 119203 140357 0.043326 *
spandiff:meanlogtime 1 400.4 119202 139956 < 2.2e-16 ***
difflogtime:meanlogtime 1 2.5 119201 139954 0.113650
spandiff:difflogtime:meanlogtime 1 15.4 119200 139938 8.809e-05 ***
It would be easy to test using simulations how this proportion is changed as the number of samples changes
So if I simplify down to the first 100 samples, it indeed seems to be (as I would predict) that the extra information provided by the recombination clock decreases (in this case to 3% additional deviance reduction). I would hope that with e.g. the GEL or UKBB data, we would see the opposite, in that the haplotype length would provide quite a lot of extra useful information on branch lengths / singleton phasing that we aren't currently getting from tsdate.
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 16623 23045
difflogtime 1 9289.8 16622 13755 < 2.2e-16 ***
meanlogtime 1 0.4 16621 13755 0.53894
spandiff 1 301.6 16620 13453 < 2.2e-16 ***
difflogtime:meanlogtime 1 32.6 16619 13421 1.143e-08 ***
difflogtime:spandiff 1 3.8 16618 13417 0.05229 .
meanlogtime:spandiff 1 25.5 16617 13392 4.455e-07 ***
difflogtime:meanlogtime:spandiff 1 4.2 16616 13387 0.03971 *
Just adding this (obvious) issue so that we have somewhere to keep track of this discussion on this. Perhaps @awohns can add a little explanation about what the critical sticking points are?