tskit-dev / tskit

Population-scale genomics
MIT License
151 stars 70 forks source link

add statistic to compute density of coalescences through time #1315

Open petrelharp opened 3 years ago

petrelharp commented 3 years ago

This is also used to look for selection. See also #1314 - maybe this is related, as each coalescence is the end of a linage.

nspope commented 3 years ago

I have been doing this with topological constraints -- e.g. calculating density of coalescence events between two tips, conditional on neither first coalescing with an outgroup -- and this gives rise to a set of statistics that are informative about time-varying migration rates. If this twist is of interest and is not already in progress, would be happy to put together a PR.

petrelharp commented 3 years ago

That'd be excellent. That sounds very interesting and innovative!

nspope commented 3 years ago

Great, then I'll take a crack at it & will try to model after the time-varying GNN API.

This is tangential but provides some context: the application of these statistics is in the spirit of RELATE's way of calculating effective population size (e.g. simply inverting pairwise coalescence density to get Ne), but extended to the multi-deme case. Say we have two demes, A and B, then there are six possible topologies of three (labelled) tips: ((A,A),B), ((A,B),B), etc. Under a piecewise-constant demography where Ne/migration rates vary over epochs, the predicted density of first coalesence events for each possible topology is solvable via standard methods for continuous-time Markov processes. These can be fit to the observed density (via a multinomial pmf) to get estimates of Ne/migration matrix per epoch. It turns out that this is a sort of nonlinear state-space model, so is well-suited for particle filters, etc.

petrelharp commented 2 years ago

We discussed this in the meeting today and got enthusiastic about it, especially @janaobsteter, @gregorgorjanc, @awohns.

Also, Chuck Langley suggested providing an easy way of telling if the data are consistent with a single-Ne model (for instance; plot within- and between-coalescence rates).

petrelharp commented 2 years ago

We might get moving on this soon, @nspope? What's your status here?

nspope commented 2 years ago

Hi @petrelharp, I have a python mock-up that I can put into a PR when I have a minute later this week. The implementation is straightforward for arbitrary pairs of tips -- e.g. no topological constraints, such as conditioning on neither tip first coalescing with an outgroup -- but gets complicated when trying to add arbitrary topological constraints.

Anyway it should serve to get the conversation going.

gregorgorjanc commented 2 years ago

This is tangential but provides some context: the application of these statistics is in the spirit of RELATE's way of calculating effective population size (e.g. simply inverting pairwise coalescence density to get Ne), but extended to the multi-deme case. Say we have two demes, A and B, then there are six possible topologies of three (labelled) tips: ((A,A),B), ((A,B),B), etc. Under a piecewise-constant demography where Ne/migration rates vary over epochs, the predicted density of first coalesence events for each possible topology is solvable via standard methods for continuous-time Markov processes. These can be fit to the observed density (via a multinomial pmf) to get estimates of Ne/migration matrix per epoch. It turns out that this is a sort of nonlinear state-space model, so is well-suited for particle filters, etc.

@nspope we are exactly after this - inverting pairwise coalescence density to get Ne over time. The data can often comprise of multiple "potential" groups/demes/clades so your idea is indeed interesting! But perhaps we should first do the simplest possible implementation encompassing the whole tree sequence?

nspope commented 2 years ago

@gregorgorjanc I agree, I think we can start with a simple, general implementation and go from there. I'm happy to put that together based on what I already have, which was for the (more specific) application I described.

gregorgorjanc commented 2 years ago

@nspope brilliant! @janaobsteter is happy to run tests on the honey bee tree sequence that she is working with:)

For completeness, msprime has a model debugging method coalescence_rate_trajectory() https://github.com/tskit-dev/msprime/blob/main/msprime/demography.py, but that gives coalescence rate according to a model, not simulated data (or ideally from inferred tree sequence, which is the aim of this issue).

petrelharp commented 2 years ago

There's going to be various ways to do this, and we'll probably want to start a new thread or so (issue? pr?) in which to put down some proposals, but I'll record this here for now:

The simplest summary (I think?) of coalescence rate is to let T denote the empirical distribution of coalescence times (across loci and pairs of samples), and compute an estimate of - (d/dt) log P(T > t). We can get an empirical estimate of P(T > t) just using the ECDF, and then the estimate of the rate in a time bin is something like the proportion of pairs of lineages that go into the time bin that come out the other side, divided by the length of the time interval (I think?). This is very like using Kaplan-Meier curves in survival analysis, except that computing confidence intervals will take a bit of thinking as the pairwise coalescence times are obviously not independent.

nspope commented 2 years ago

EDIT: I was wrong, for the reason described here.

My approach for confidence intervals has been to calculate counts of coalescent events in time windows separately for disjoint pairs of tips (so that each pair is independent). Then treat the total number of coalescent events within a time window as a Poisson process with rate d_e * p_e/N_e where d_e is the duration of window e, N_e is the effective population size in that window, and p_e is the number of independent pairs that haven't coalesced by the beginning of the window. The MLE & std error are straightforward in this case. This is justified when there's a bunch of independent trees because the coalescent is a Markov process (so its associated counting process is a Poisson process).

To get valid confidence intervals using all possible pairs of tips, maybe resampling would work ... Like calculating counts of coalescent events through time separately for all pairs, then bootstrapping by sampling these "count trajectories" with replacement? It seems really challenging to come up with a bootstrapping scheme that would replicate the dependence between pairs.

Likewise, I don't know of a way to incorporate dependence between nearby trees without resampling. Such as calculating these statistics across disjoint intervals of the tree sequence, and bootstrapping these intervals.