tskit-dev / tsdate

Infer the age of ancestral nodes in a tree sequence.
MIT License
18 stars 7 forks source link

Add nice plot of inverse instantaneous coalescent rate to docs #399

Closed hyanwong closed 1 month ago

hyanwong commented 1 month ago

I've left a space in the docs (four paragraphs down at https://tskit.dev/tsdate/docs/latest/popsize.html) to place a population-size-over-time plot. I think this will involve plotting the IICR over time and comparing it to the demography object using to create the simulation.

I think you know how to make plots like this, right @nspope ? Do you have any tips?

hyanwong commented 1 month ago

Hmm, looks like you use ts.coalescence_time_distribution something like the implementation you describe here, but I'm not sure it's worth using the to make an example in the docs if the interface is going to change?

hyanwong commented 1 month ago

Ah, I see there's a recent commit that changes the (undocumented) API: https://github.com/tskit-dev/tskit/commit/e6483fc8cdb1efa7ea54eead6073a85b737315b5 - description of the new interface at https://github.com/tskit-dev/tskit/pull/2915

nspope commented 1 month ago

edit added the true rates calculated from the model w/ msprime.

The API is only partially implemented at this point (e.g. calculating number of coalescing pairs per node). Until then, here's an example that does the rest manually:

import stdpopsim
import numpy as np
import matplotlib.pyplot as plt

def rates_from_ecdf(weights, atoms, quantiles):
    assert weights.size == atoms.size
    assert np.all(weights > 0.0)
    assert np.all(atoms > 0.0)
    assert np.all(np.diff(atoms) >= 0.0)
    assert quantiles[0] == 0.0 and quantiles[-1] == 1.0
    assert quantiles.size > 2

    # find interior quantiles
    weights = np.append(0, weights)
    atoms = np.append(0, atoms)
    ecdf = np.cumsum(weights)
    indices = np.searchsorted(ecdf, ecdf[-1] * quantiles[1:-1], side='left')
    lower, upper = atoms[indices - 1], atoms[indices]
    ecdfl, ecdfu = ecdf[indices - 1] / ecdf[-1], ecdf[indices] / ecdf[-1]
    assert np.all(np.logical_and(quantiles[1:-1] > ecdfl,  quantiles[1:-1] < ecdfu))

    # interpolate ECDF
    assert np.all(ecdfu - ecdfl > 0)
    slope = (upper - lower) / (ecdfu - ecdfl)
    breaks = np.append(0, lower + slope * (quantiles[1:-1] - ecdfl))

    # calculate coalescence rates within intervals
    # this uses a Kaplan-Meier-type censored estimator: https://github.com/tskit-dev/tskit/pull/2119
    coalrate = np.full(quantiles.size - 1, np.nan)
    propcoal = np.diff(quantiles[:-1]) / (1 - quantiles[:-2])
    coalrate[:-1] = -np.log(1 - propcoal) / np.diff(breaks)
    last = indices[-1]
    coalrate[-1] = np.sum(weights[last:]) / np.dot(atoms[last:] - breaks[-1], weights[last:]) 

    return coalrate, breaks

# simulate some data
engine = stdpopsim.get_engine("msprime")
homsap = stdpopsim.get_species("HomSap")
contig = homsap.get_contig("chr1", right=10e6)
demogr = homsap.get_demographic_model("Zigzag_1S14")
ts = engine.simulate(demographic_model=demogr, contig=contig, samples={"generic" : 100})

# calculate coalescence rate between equally-spaced quantiles of the pair coalescence time distribution
quantiles = np.linspace(0, 1, 100)
coal_pairs = ts.pair_coalescence_counts()  # number of coalescing pairs per node
atoms = ts.nodes_time[coal_pairs > 0]
weights = coal_pairs[coal_pairs > 0]
order = np.argsort(atoms)
rates, breaks = rates_from_ecdf(weights[order], atoms[order], quantiles)

# expected coalescence rates from model, and true Ne (using msprime)
debugger = demogr.model.debug()
true_rates = debugger.coalescence_rate_trajectory(steps=breaks, lineages={"generic":2})[0]
finescale = np.linspace(0, max(breaks), 1000)
true_ne = 0.5 / debugger.coalescence_rate_trajectory(steps=finescale, lineages={"generic":2})[0]

plt.step(breaks, 0.5 / rates, label='empirical quantiles')
plt.step(breaks, 0.5 / true_rates, label='theoretical quantiles')
plt.step(finescale, true_ne, label='model')
plt.xlabel("Generations in past")
plt.ylabel("Ne (diploid)")
plt.xscale("log")
plt.legend()

# note that the quantile method, while having minimal variance per epoch, averages over recent
# time (as there's not many coalescence events)

foo

nspope commented 1 month ago

I should mention, I don't think this approach will give very usable estimates on tsinfer'd tree sequences. And, I'd rather not encourage people to use this API-- the variational gamma + mutation rescaling approach should be way more robust to population size changes, with no fiddling on the part of the user. So do we even need an example here?

hyanwong commented 1 month ago

This is a fair point. I have commented out that bit of the docs and will close this.

Nevertheless, some examples of utility for the dates of nodes (as well as / instead of mutations) might be nice to illustrate.