Incorporate recombination clock

hyanwong commented 5 years ago

Just adding this (obvious) issue so that we have somewhere to keep track of this discussion on this. Perhaps @awohns can add a little explanation about what the critical sticking points are?

awohns commented 5 years ago

The fundamental problem is that a consequence of the tree sequence data format is that you inevitably lose some of the edges when a recombination occurs and this leads the recombination clock to underestimate node age. This problem can be managed in the n=2 case (this is how GEVA works), but our current thinking is that it is an intractable problem in n=3 and above... or maybe not! I think it'll be worth bringing this up again with Gil

hyanwong commented 3 months ago

There's an interesting application of this to the phasing singletons issue in #374. We could probably test (with real data), how much extra information there is in the haplotype lengths (i.e. the recombination clock) for phasing singletons, once the tsdated branch lengths are taken into account. The haplotype lengths were essentially what was used by Platt et al in Jody Hey's lab for phasing singletons (the inspiration for Nate's singleton phasing idea)

Since any analysis of singletons is only based on data present at the very tips of the trees, we would be testing the "best case" scenario, where we might expect the recombination clock to be pretty valuable for estimating dates.

Essentially this is like trying to predict the phase (mum or dad) from (a) the difference in tsdated times between the samples in the individual and the two different parent nodes and (b) the difference in span of the edge between the samples in the individual and the two different parent nodes.

We can then do a logistic regression of phase on (a) and (b). You might also expect the average branch length to play a role in the robustness of the model, so I have done an analysis in R with that added that as a covariate, and also put in the interaction terms. I've used log dates and log spans too.

This is only a rough way to do the analysis. A better way would be to use the theoretically expected IBD lengths (like in Platt et al) to calibrate the expected contribution of haplotype length on phasing. But I think it's good enough to get a feel for the relative magnitude of contributions.

df <- read.csv("analysis_output.csv")
model <- glm(TruePhase ~ difflogtime * meanlogtime * logspandiff, data=df, family="binomial")
anova(model)

Gives:

Analysis of Deviance Table

Model: binomial, link: logit

Response: TruePhase

Terms added sequentially (first to last)

                                    Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                                               119207     165246              
difflogtime                          1  19540.4    119206     145705 < 2.2e-16 ***
meanlogtime                          1      8.8    119205     145697  0.002975 ** 
logspandiff                          1   5365.7    119204     140331 < 2.2e-16 ***
difflogtime:meanlogtime              1    778.3    119203     139552 < 2.2e-16 ***
difflogtime:logspandiff              1      1.6    119202     139551  0.204540    
meanlogtime:logspandiff              1    199.3    119201     139352 < 2.2e-16 ***
difflogtime:meanlogtime:logspandiff  1      1.0    119200     139351  0.322044    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

In other words, we can probably increase the predictive accuracy of the phasing algorithm by about 25% more than we currently have (deviance of logspandiff: 5365.7 compared to that of difflogtime: 19540.4), by also incorporating the length of the haplotype into the calculation.

This gives a (very) rough idea of how much extra information we have in the recombination clock in the best-case situation of "recent" variants. However, as "recent" is relative to the number of samples in the dataset, I suspect that haplotype length will provide more information than this as the number of samples increases to tens or hundreds of thousands (we should test this when we have the GEL data).

Code for this example

```python import tszip import tsdate # file from https://github.com/tskit-dev/tsdate/issues/374#issuecomment-2176188312 ts = tszip.decompress("1kgp_bal1500_phased_singletons.trees.tsz") dts = tsdate.date(ts, mutation_rate=1.29e-8, singletons_phased=False, progress=True) # Identify singletons: def id_singletons(correct_phase_ts, dts, combine_edges=False): """ combine_edges=True is slower and uses the combined span of all edges with the same parent/child combination in the spandiff and spanmean values. This seesm to make the fit worse, so not worth doing """ assert np.all(correct_phase_ts.breakpoints(as_array=True) == dts.breakpoints(as_array=True)) assert correct_phase_ts.num_nodes == dts.num_nodes assert np.all(correct_phase_ts.samples() == dts.samples()) samples = set(dts.samples()) data = [] for cp_tree, tree in zip(correct_phase_ts.trees(), dts.trees()): if tree.num_edges == 0: continue for cp_site, site in zip(cp_tree.sites(), tree.sites()): if len(cp_site.mutations) != 1: continue if cp_site.mutations[0].node not in samples: continue cp_indiv = correct_phase_ts.nodes_individual[cp_site.mutations[0].node] indiv = dts.nodes_individual[site.mutations[0].node] assert cp_indiv == indiv if cp_indiv == tskit.NULL: continue cp_indiv = correct_phase_ts.individual(cp_indiv) nodes = np.array(cp_indiv.nodes) assert cp_site.mutations[0].node in nodes assert len(nodes) == 2 assert nodes[0] != nodes[1] # Find parents above each node in this position parents = np.array([cp_tree.parent(u) for u in nodes]) if combine_edges: edgespans = [] for u, p in zip(nodes, parents): use = np.logical_and(correct_phase_ts.edges_child == u, correct_phase_ts.edges_parent == p) edgespans.append(np.sum(correct_phase_ts.edges_right[use] - correct_phase_ts.edges_left[use])) else: edges = [correct_phase_ts.edge(cp_tree.edge(u)) for u in nodes] for e, u, p in zip(edges, nodes, parents): assert e.parent == p assert e.child == u edgespans = [e.right - e.left for e in edges] assert parents[0] != tskit.NULL times = dts.nodes_time[parents] difftime = np.diff(times)[0] meantime = np.mean(times) spandiff = np.diff(edgespans)[0] spanmean = np.mean(edgespans) difflogtime = np.diff(np.log(times))[0] meanlogtime = np.mean(np.log(times)) logspandiff = np.diff(np.log(edgespans))[0] logspanmean = np.mean(np.log(edgespans)) val = int((nodes[0] == cp_site.mutations[0].node)) data.append([val, difftime, meantime, spandiff, spanmean, difflogtime, meanlogtime, logspandiff, logspanmean]) return data data = id_singletons(ts, dts) import pandas as pd df = pd.DataFrame( data, columns=["TruePhase", "difftime", "meantime", "spandiff", "spanmean", "difflogtime", "meanlogtime", "logspandiff", "logspanmean"] ) df.to_csv("analysis_output.csv") ```

hyanwong commented 3 months ago

It would be easy to test using simulations how this proportion is changed as the number of samples changes. Also, FWIW, here's the ANOVA table when logspandiff is first. So it looks like there is about the same amount of information (actually a tad more) in the haplotype length as we are using in the node age differences. But (as expected), the majority of information is shared between haplotype length and tsdate-estimated time..

                                 Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                                            119207     165246              
spandiff                          1  20991.6    119206     144254 < 2.2e-16 ***
difflogtime                       1   3886.5    119205     140368 < 2.2e-16 ***
meanlogtime                       1      7.0    119204     140361  0.008256 ** 
spandiff:difflogtime              1      4.1    119203     140357  0.043326 *  
spandiff:meanlogtime              1    400.4    119202     139956 < 2.2e-16 ***
difflogtime:meanlogtime           1      2.5    119201     139954  0.113650    
spandiff:difflogtime:meanlogtime  1     15.4    119200     139938 8.809e-05 ***

hyanwong commented 3 months ago

It would be easy to test using simulations how this proportion is changed as the number of samples changes

So if I simplify down to the first 100 samples, it indeed seems to be (as I would predict) that the extra information provided by the recombination clock decreases (in this case to 3% additional deviance reduction). I would hope that with e.g. the GEL or UKBB data, we would see the opposite, in that the haplotype length would provide quite a lot of extra useful information on branch lengths / singleton phasing that we aren't currently getting from tsdate.

                                 Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                                             16623      23045              
difflogtime                       1   9289.8     16622      13755 < 2.2e-16 ***
meanlogtime                       1      0.4     16621      13755   0.53894    
spandiff                          1    301.6     16620      13453 < 2.2e-16 ***
difflogtime:meanlogtime           1     32.6     16619      13421 1.143e-08 ***
difflogtime:spandiff              1      3.8     16618      13417   0.05229 .  
meanlogtime:spandiff              1     25.5     16617      13392 4.455e-07 ***
difflogtime:meanlogtime:spandiff  1      4.2     16616      13387   0.03971 *

tskit-dev / tsdate

Incorporate recombination clock #5