Closed hyanwong closed 1 month ago
Hmm, looks like you use ts.coalescence_time_distribution
something like the implementation you describe here, but I'm not sure it's worth using the to make an example in the docs if the interface is going to change?
Ah, I see there's a recent commit that changes the (undocumented) API: https://github.com/tskit-dev/tskit/commit/e6483fc8cdb1efa7ea54eead6073a85b737315b5 - description of the new interface at https://github.com/tskit-dev/tskit/pull/2915
edit added the true rates calculated from the model w/ msprime.
The API is only partially implemented at this point (e.g. calculating number of coalescing pairs per node). Until then, here's an example that does the rest manually:
import stdpopsim
import numpy as np
import matplotlib.pyplot as plt
def rates_from_ecdf(weights, atoms, quantiles):
assert weights.size == atoms.size
assert np.all(weights > 0.0)
assert np.all(atoms > 0.0)
assert np.all(np.diff(atoms) >= 0.0)
assert quantiles[0] == 0.0 and quantiles[-1] == 1.0
assert quantiles.size > 2
# find interior quantiles
weights = np.append(0, weights)
atoms = np.append(0, atoms)
ecdf = np.cumsum(weights)
indices = np.searchsorted(ecdf, ecdf[-1] * quantiles[1:-1], side='left')
lower, upper = atoms[indices - 1], atoms[indices]
ecdfl, ecdfu = ecdf[indices - 1] / ecdf[-1], ecdf[indices] / ecdf[-1]
assert np.all(np.logical_and(quantiles[1:-1] > ecdfl, quantiles[1:-1] < ecdfu))
# interpolate ECDF
assert np.all(ecdfu - ecdfl > 0)
slope = (upper - lower) / (ecdfu - ecdfl)
breaks = np.append(0, lower + slope * (quantiles[1:-1] - ecdfl))
# calculate coalescence rates within intervals
# this uses a Kaplan-Meier-type censored estimator: https://github.com/tskit-dev/tskit/pull/2119
coalrate = np.full(quantiles.size - 1, np.nan)
propcoal = np.diff(quantiles[:-1]) / (1 - quantiles[:-2])
coalrate[:-1] = -np.log(1 - propcoal) / np.diff(breaks)
last = indices[-1]
coalrate[-1] = np.sum(weights[last:]) / np.dot(atoms[last:] - breaks[-1], weights[last:])
return coalrate, breaks
# simulate some data
engine = stdpopsim.get_engine("msprime")
homsap = stdpopsim.get_species("HomSap")
contig = homsap.get_contig("chr1", right=10e6)
demogr = homsap.get_demographic_model("Zigzag_1S14")
ts = engine.simulate(demographic_model=demogr, contig=contig, samples={"generic" : 100})
# calculate coalescence rate between equally-spaced quantiles of the pair coalescence time distribution
quantiles = np.linspace(0, 1, 100)
coal_pairs = ts.pair_coalescence_counts() # number of coalescing pairs per node
atoms = ts.nodes_time[coal_pairs > 0]
weights = coal_pairs[coal_pairs > 0]
order = np.argsort(atoms)
rates, breaks = rates_from_ecdf(weights[order], atoms[order], quantiles)
# expected coalescence rates from model, and true Ne (using msprime)
debugger = demogr.model.debug()
true_rates = debugger.coalescence_rate_trajectory(steps=breaks, lineages={"generic":2})[0]
finescale = np.linspace(0, max(breaks), 1000)
true_ne = 0.5 / debugger.coalescence_rate_trajectory(steps=finescale, lineages={"generic":2})[0]
plt.step(breaks, 0.5 / rates, label='empirical quantiles')
plt.step(breaks, 0.5 / true_rates, label='theoretical quantiles')
plt.step(finescale, true_ne, label='model')
plt.xlabel("Generations in past")
plt.ylabel("Ne (diploid)")
plt.xscale("log")
plt.legend()
# note that the quantile method, while having minimal variance per epoch, averages over recent
# time (as there's not many coalescence events)
I should mention, I don't think this approach will give very usable estimates on tsinfer'd tree sequences. And, I'd rather not encourage people to use this API-- the variational gamma + mutation rescaling approach should be way more robust to population size changes, with no fiddling on the part of the user. So do we even need an example here?
This is a fair point. I have commented out that bit of the docs and will close this.
Nevertheless, some examples of utility for the dates of nodes (as well as / instead of mutations) might be nice to illustrate.
I've left a space in the docs (four paragraphs down at https://tskit.dev/tsdate/docs/latest/popsize.html) to place a population-size-over-time plot. I think this will involve plotting the IICR over time and comparing it to the demography object using to create the simulation.
I think you know how to make plots like this, right @nspope ? Do you have any tips?