Closed Luker121 closed 1 month ago
Hello! Good question; the short answer is: it's a work in (active) progress; see for instance https://github.com/tskit-dev/tskit/pull/2985 .
However, what you're calculating is (I think) somewhat different - for instance, I don't think that the mean
and variance
are the mean and variance of the same distribution - not for instance, that mean
depends on span and variance
does not. The quantiles are, I think, quantiles of the distribution that's something like "pick a random tree and a random pair of individuals and find their inverse TMRCA"; this is different than "pick a random location on the genome and do the same thing". I have not thought through this in detail, but I wouldn't have called those inverse TMRCAs "rates" in this case? However, I may well be missing something.
I'll close this, but re-open if you like.
Hi, thanks for your fast reply. I appreciate the explanation on the distributions for the mean and variance. Here’s what I would do to get the weighted mean and weighted variance of my "inverse TMRCA":
For example:
# Span-weighted mean
mean_tmrca = np.sum(np.array(pairwise_tmrcas) * np.array(weights)) / np.sum(weights)
# Span-weighted variance
variance_tmrca = np.sum(weights * (np.array(pairwise_tmrcas) - mean_tmrca)**2) / np.sum(weights)
This way, the variance is based on the same span-weighted distribution as the mean. But this is not really what is meant with pairwise coalescent rates, right? It would rather be an inverse pairwise TMRCA as I am inverting coalescent times rather than calculating coalescent rates? In any case I am very much looking forward to a build-in function to extract (pairwise) coalescent rates from .trees files. :)
It looks like you're estimating (negative) moments of the pair coalescence time distribution, not coalescence rates. In either case, the way to do it more efficiently is to consider a weighted sum over nodes (where the weights are proportional to the average number of pairs coalescing at the node) rather than a weighted sum over pairs of samples per tree.
If it is actually pair coalescence rates you're after, these are pretty easily calculated from the empirical CDF (cumulative sum of weights with increasing node time). As @petrelharp says, these are in progress and should make it into the development version of tskit pretty shortly. If you wanted to work them out yourself (using node weights calculated in code snippet below), there's a description here.
However assuming that you're actually after negative moments, it'd be good to step back and ask if these actually exist. For example, let's say we're in an equilibrium population with haploid size $N$. Then under the coalescent model the distribution of time to coalescence $t$ of a pair is exponential with rate $1 / N$. In this case, $\mathbb{E}[t^{-1}] = \mathbb{E}[t^{-2}] = \infty$ (e.g. look at moments of an inverse gamma). So it's not clear to me that your weighted sum will converge to a meaningful expectation no matter how long the sequence.
But, let's say you wanted to calculate moments of a functional, that does exist (for example $\mathbb{E}[\log t]$, $\mathbb{V}[\log t]$, etc. In this case, you could calculate the node weights efficiently with the development version of tskit (installed from github),
import msprime
import numpy as np
ts = msprime.sim_ancestry(1000, population_size=1e4, recombination_rate=1e-8, sequence_length=5e7, random_seed=1)
# there is also a "windows" argument to this method
weights = ts.pair_coalescence_counts(pair_normalise=True, span_normalise=True, time_windows="nodes")
values = ts.nodes_time
# assume you want to estimate E[log(t)], V[log(t)] etc
weights = weights[values > 0]
values = np.log(values[values > 0])
mean = np.sum(weights * values) / np.sum(weights)
var = np.sum(weights * (values - mean) ** 2) / np.sum(weights)
As a sanity check, comparing against simulated iid pairs:
ts_pair = msprime.sim_ancestry(1, population_size=1e4, recombination_rate=0, sequence_length=1, num_replicates=10000, random_seed=2)
iid_values = np.array([ts_rep.nodes_time.max() for ts_rep in ts_pair])
mean_iid = np.mean(np.log(iid_values))
var_iid = np.var(np.log(iid_values), ddof=0)
print(f"iid mean {mean_iid}; full ts mean {mean}")
print(f"iid var {var_iid}; full ts var {var}")
# iid mean 9.302555147007228; full ts mean 9.327968582745994
# iid var 1.67762330135097; full ts var 1.6448586251316892
fwiw there's now a ts.pair_coalescence_rates(time_windows, ...)
method in the development version of tskit.
I'm encountering a segmentation fault while using the new pair_coalescence_rates function in the development version of tskit. The error occurs when I run my script to calculate pairwise coalescent rates over windows in a tree sequence. The error output suggests a bug in lib/tskit/trees.c at line 9800.
Environment
tskit version: `0.5.8.dev0`
Python version: `3.9`
Operating System: Fedora Linux 40
Error message:
Bug detected in lib/tskit/trees.c at line 9800. If you are using tskit directly please open an issue on GitHub, ideally with a reproducible example.
I have a SLiM-generated tree sequence .trees
file.
I'm using the pair_coalescence_rates
function with the following parameters:
time_windows=np.array([0, float('inf')])
sample_sets: 4 populations
indexes: itertools.combinations_with_replacement(range(len(populations)), 2)
windows: sequence windows of size 200kb
The error occurs during the second window of analysis (200,000–400,000
).
Here is a simplified version of my code to reproduce the error:
import tskit
import numpy as np
import itertools
# Load tree sequence
tree_seq = tskit.load("example.trees")
# Define populations (4 populations)
populations = [
list(range(0, 10)),
list(range(10, 25)),
list(range(25, 31)),
list(range(31, 37))
]
# Define indexes for pairwise coalescence
indexes = list(itertools.combinations_with_replacement(range(len(populations)), 2))
# Calculate coalescent rates
windows = [0, 200000, 400000]
pairwise_coal_rates = tree_seq.pair_coalescence_rates(
time_windows=np.array([0, float('inf')]),
sample_sets=populations,
indexes=indexes,
windows=windows
)
I've tried running on different tree sequence files, but the error persists.
Ah ha, thanks - could you provide the example.trees file or the SLiM script that produces it?
ok, thanks a lot for looking into this. I have a .trees file that spans a 2Mb region. And I am trying to go along that 2Mb region in 200kb windows and calculate the pairwise coalescent rate. It stops when trying to estimate the coalescent rate after the first window (0-200kb). Here is a more thorough version of my script:
import tskit
import numpy as np
import itertools
tree_seq = tskit.load("example.trees")
populations = [
list(range(0, 10)), # Population 1
list(range(10, 25)), # Population 2
list(range(25, 31)), # Population 3
list(range(31, 37)) # Population 4
]
# sample set indexes for pairwise coalescence
indexes = list(itertools.combinations_with_replacement(range(len(populations)), 2))
# windows along the genome
window_starts = np.arange(0, tree_seq.sequence_length, 200000)
window_ends = np.minimum(window_starts + 200000, tree_seq.sequence_length)
windows = np.column_stack((window_starts, window_ends))
# calculate pairwise coalescence rates and quantiles in windows
for window_index, (start, end) in enumerate(windows):
print(f"\nWindow {window_index + 1}: Start = {start}, End = {end}")
try:
pairwise_coal_rates = tree_seq.pair_coalescence_rates(
time_windows=np.array([0, 5000, 500000, float('inf')]),
sample_sets=populations,
indexes=indexes,
windows=[start, end]
)
quantiles = tree_seq.pair_coalescence_quantiles(
[0.05, 0.95],
sample_sets=populations,
indexes=indexes,
windows=[start, end]
)
print("Coalescent rates:", pairwise_coal_rates)
print("Quantiles:", quantiles)
except Exception as e:
print(f"Error in window {start}-{end}: {e}")
I attached the .trees
file as .zip
file.
example_tree.zip
Should we open an issue to track this specific bug and make sure we catch it before release?
I'm leaving that to @nspope - maybe he's got this already dealt with?
Great, thanks for trying it out and reporting this @Luker121. The issue is that the windows argument needs to be an increasing list of windows going from 0 to the sequence length-- just like with the other stats, you can't do isolated windows (for the moment anyway). For example, this works:
#!/usr/bin/env bash
import tskit
import numpy as np
import itertools
tree_seq = tskit.load("example.trees")
populations = [
list(range(0, 10)), # Population 1
list(range(10, 25)), # Population 2
list(range(25, 31)), # Population 3
list(range(31, 37)), # Population 4
]
# sample set indexes for pairwise coalescence
indexes = list(itertools.combinations_with_replacement(range(len(populations)), 2))
# windows along the genome
window_breaks = np.arange(0, tree_seq.sequence_length, 200000)
# calculate pairwise coalescence rates and quantiles in windows
pairwise_coal_rates = tree_seq.pair_coalescence_rates(
time_windows=np.array([0, 5000, 500000, float('inf')]),
sample_sets=populations,
indexes=indexes,
windows=window_breaks,
)
quantiles = tree_seq.pair_coalescence_quantiles(
[0.05, 0.95],
sample_sets=populations,
indexes=indexes,
windows=window_breaks,
)
in the outputs, pairwise_coal_rates[i]
would give you the output for the ith window.
It should definitely be erroring out with an informative message when you give it isolated windows, not segfaulting, so that's a bug!
Thanks a lot for your help, that’s working now. I have just one last question: Is there a way of getting some kind of CI or variance for the pairwise coalescent rate per time window?
Not a straightforward way at the moment, no. At the sequence level (or for a large enough window) you could block-bootstrap, which basically involves subdividing the genomic interval into blocks; using ts.pair_coalescence_counts to get weights per time interval per block; then resampling blocks, summing weights per time window over blocks, and using the formulas in the docstring for ts.pair_coaelscence_rates to get a bootstrap replicate for the weights. But I'm not sure how well that'd work with smaller windows.
While there are certainly ways to get confidence intervals for hazard rates in the survival literature (this routine is calculating a hazard rate, more or less), these all assume iid obervations -- and that is definitely not the case here
Hi,
I have a .trees file with 37 individuals from 4 populations generated in SLiM, which I have recapitated, simplified, and added mutations to. Now, I want to calculate the pairwise coalescent rate both between and within populations, and obtain the mean and variance for each population pair.
Here is a simplified version of my current code:
In the loop, I iterate over each tree and individual in the respective population(s), compute the TMRCA using tree.tmrca(ind1, ind2), and weight the values by tree span. Then, I calculate the inverse of the weighted TMRCAs and scale by 2Ne (since the individuals are diploid).
My question is: is there a more efficient way to achieve this, or is there a built-in function to extract pairwise coalescent rate values directly from .trees files? Or is there a function to extract coalescent rates in general? Because I couldn't find any of this for more than one diploid individual.