Open hyanwong opened 1 year ago
(possibly broken up by population, if necessary)
It may be worth thinking about how we would allocate different Ne transformations for different populations. Since this is simply affecting the prior, we could theoretically allocate a different transformation for each node if we wanted, without working about topological interactions between the nodes.
I think Relate allows separate Ne for separate population (and it's used quite a lot), so it might be worth thinking about how it could be done. Maybe the first thing is to work out how we would assign nodes to populations.
I wonder if it's also worth thinking about what we would do in (semi) continuous space, where we can't clearly assign samples to discrete populations.
I wonder if it's also worth thinking about what we would do in (semi) continuous space, where we can't clearly assign samples to discrete populations.
The reason why allocating nodes to populations works is that a population is usually large enough that there are a number of coalescent points within that population during each time slice. In terms of tsdate, we want to allocate a prior distribution to a specific ancestor, and we can use the density of coalescence points within the population in which that ancestor resides as a measure of population size to scale that distribution to fit the demography. In other words, we can average coalescent density among all members of a sub population.
We could try extending that to a continuous population by other sorts of averaging. For instance, we could allocate (rough) geographical locations to ancestors, in the way we did in the unified genealogy paper. We could then have some function which looks at the average coalescent density around a target ancestor, taking all coalescence points from the same time-slice but weighting the density by the inverse squared distance of each coalescence from the target ancestral lineage(s) at that timepoint.
There are a couple of issues here. The one that first strikes me is whether in the second round of tsdate usage, we (a) set the prior distribution for each node to be the posterior of the initial inference, or (b) reset the priors to the conditional coalescent but simply rescale the times using the newly assessed Ne mapping. My inclination is to do (b).
Yes, I agree that (b) is the way to go. It's sort-of an empirical Bayes procedure-- it'd work because the "prior" for each node is calculated using global information (that is robust to per-node uncertainty). With (a), I suspect the posterior for each node would concentrate on a single time slice as the number of iterations increases.
I think Relate allows separate Ne for separate population (and it's used quite a lot), so it might be worth thinking about how it could be done. Maybe the first thing is to work out how we would assign nodes to populations.
I may be misremembering, but I think that while Relate will calculate average pairwise coalescence rates between populations, it does not use these in the prior for branch lengths. Instead, I believe this is done via a "global" pairwise coalescence rate on a logarithmic grid, that is updated with each iteration.
It's definitely worth thinking about how to incorporate discrete/continuous locations, though. Perhaps also genomic coordinate (I'm thinking about large SVs).
For the calculation of Ne through time (at least in the single population case) at each iteration, we can use TreeSequence.coalescence_time_distribution
to calculate rates of pairwise coalescence events in time slices. This is basically what Relate does.
So, it may be worth allowing the population size input to the tsdate prior to be one of (a) a single number; (b) a 2d array with epoch start/population size; (c) a callable.
Yes, I agree that (b) is the way to go. It's sort-of an empirical Bayes procedure-- it'd work because the "prior" for each node is calculated using global information (that is robust to per-node uncertainty). With (a), I suspect the posterior for each node would concentrate on a single time slice as the number of iterations increases.
Excellent, glad we are agreed.
For the calculation of Ne through time (at least in the single population case) at each iteration, we can use TreeSequence.coalescence_time_distribution to calculate rates of pairwise coalescence events in time slices. This is basically what Relate does.
I think we should plan for a case where we can set a separate function to scale the prior for each node. I'm not sure about cross coalescence through.
@nspope - have you thought about what we do for node metadata when iterating (see https://github.com/tskit-dev/tsdate/issues/203#issuecomment-1438679651)? Specifically, tsdate currently stores the mean and variance of estimates of node time. So should we have a parameter to .date()
that says store_metadata=True
, which we set to False on the first few rounds of iteration? Or do we store metadata every round of iteration, and simply stomp on the previous values that we have saved. Or do we have a parameter to .date()
that says "stomp on existing tsdate metadata if it already exists"?
I haven't thought about it, but an overwrite_metadata
option sounds like a good idea. I guess the concern is that wiping previous metadata could delete non-tsdate stuff?
With the iterative approach, I'm still not sure how this should be incorporated in the API. Currently, I've been demo'ing the approach with a function that repeatedly calls date
. But I guess it could also be an option to date
, in which case we could just hold off adding the metadata until the last iteration. I need to finish https://github.com/tskit-dev/tskit/pull/2621 before moving forward on that though.
In humans, we expect some individuals (admixed ones especially) to have genomes that are a mixed component of different "ancestries", which potentially quite different Ne values. We might be able to spot this by looking along the genome and identifying patterns of coalescence time distributions that are radically different from one region of the genome to another. Having a "base Ne" pattern might be helpful here. It would presumably be quite easy to test this using one of the stdpopsim admixture models.
I wonder if there is any statistical way to cluster genomic regions (or indeed ancestral nodes) like this? For example, we could plot the distribution of (coalescence-rescaled) times to the next coalescence for each (ancestral) node, and see if it looks bimodal. Admixed individuals would then look like they had a mix of 2 sorts of regions with different time distributions. We might need to do this for many different chromosomes to get a sensible pattern, through.
There definitely is -- the easiest way is to augment with a latent variable per node that indicates cluster membership. Then, the prior for each node is a mixture over components. Inferring the mixture weights can be done in the EP algorithm (or, an EM-like update could be used -- this would probably work with the original, time-discretized approach). Andy and I have been talking about this quite a lot.
The localized, marginal prior I'm working on does something similar, in that the genome is chopped up into windows and each window has it's own time scaling. It'd be pretty easy to turn this into an unsupervised mixture (e.g. drop the a priori windowing).
Just to note that @nspope has a nice example of using ts.coalescence_time_distribution
in an iterative way at https://github.com/tskit-dev/tsdate/pull/251#issuecomment-1426637599
Here's an example:
def estimate_demography(ts, num_epochs):
"""
Calculate population size trajectory from pair coalescence rates
"""
coaltime_distr = ts.coalescence_time_distribution(
weight_func="pair_coalescence_events",
span_normalise=True,
)
quantiles = np.linspace(0.0, 1.0, num_epochs+1)
breaks = np.sort(np.unique(coaltime_distr.quantile(quantiles).flatten()))
breaks[0] = 0.0
breaks[-1] = np.inf
# remove epochs with no coalescence events
nonempty = np.append([True], np.diff(coaltime_distr.ecdf(breaks).flatten()) > 0.0)
breaks = breaks[nonempty]
coalrate = coaltime_distr.coalescence_rate_in_intervals(breaks).flatten()
demog = PopulationSizeHistory(0.5/coalrate, breaks[1:-1])
return demog
Used like,
# initially estimate Ne from site diversity
prior = MixturePrior(ts)
n0 = 0.25*ts.diversity(mode='site', span_normalise=True)/mutation_rate
demography0 = PopulationSizeHistory(n0)
ts0 = tsdate.date(ts, mutation_rate, priors=prior.make_discretized_prior(demography0)) # NB: use `make_parameter_grid` for `method='variational_gamma'`
# variable Ne prior from coalescence rates
demography1 = estimate_demography(ts0, num_epochs=25)
ts1 = tsdate.date(ts, mutation_rate, priors=prior.make_discretized_prior(demography1))
We can attempt to infer a variable Ne by performing a tsdate inference, assessing the coalescence distribution over time (possibly broken up by population, if necessary), then using a rescaled set of timepoints to reflect variable Ne (see #228 ), and performing another tsdate inference. We could continue to iterate like this until the timepoint scaling equilibrates.
There are a couple of issues here. The one that first strikes me is whether in the second round of tsdate usage, we (a) set the prior distribution for each node to be the posterior of the initial inference, or (b) reset the priors to the conditional coalescent but simply rescale the times using the newly assessed Ne mapping. My inclination is to do (b). We know that sometimes the posterior times are a bit off, so using them as a prior seems like we might be feeding Tsdate-related errors back into the system, amplifying any tsdate issues.
I wonder if @awohns and @nspope have any thoughts here?