tskit-dev / tskit

Population-scale genomics
MIT License
155 stars 73 forks source link

Clarification on pairwise coalescent rates #2986

Closed Luker121 closed 1 month ago

Luker121 commented 1 month ago

Hi,

I have a .trees file with 37 individuals from 4 populations generated in SLiM, which I have recapitated, simplified, and added mutations to. Now, I want to calculate the pairwise coalescent rate both between and within populations, and obtain the mean and variance for each population pair.

Here is a simplified version of my current code:

for tree in tree_seq.trees():
        tree_start, tree_end = tree.interval
        overlap_span = min(tree_end, window_end) - max(tree_start, window_start)
        if overlap_span <= 0:
            continue

        for ind1 in pop1:
            for ind2 in pop2:
                if ind1 != ind2 or pop1 != pop2:
                    tmrca = tree.tmrca(ind1, ind2)
                    if tmrca > 0:
                        pairwise_tmrcas.append(tmrca)
                        weights.append(overlap_span)
                        total_weights += overlap_span

    if pairwise_tmrcas and total_weights > 0:
        weighted_mean_tmrca = np.sum(np.array(pairwise_tmrcas) * np.array(weights)) / np.sum(weights)
        pairwise_rate = 1 / weighted_mean_tmrca if weighted_mean_tmrca > 0 else np.nan

        pairwise_rates = [1 / tmrca for tmrca in pairwise_tmrcas if tmrca > 0]  # Avoid division by zero
        variance_rate = np.var(pairwise_rates)
        q5_rate = np.percentile(pairwise_rates, 5)
        q95_rate = np.percentile(pairwise_rates, 95)

        return {
            'mean': pairwise_rate,
            'variance': variance_rate,
            'q5': q5_rate,
            'q95': q95_rate
        }

In the loop, I iterate over each tree and individual in the respective population(s), compute the TMRCA using tree.tmrca(ind1, ind2), and weight the values by tree span. Then, I calculate the inverse of the weighted TMRCAs and scale by 2Ne (since the individuals are diploid).

My question is: is there a more efficient way to achieve this, or is there a built-in function to extract pairwise coalescent rate values directly from .trees files? Or is there a function to extract coalescent rates in general? Because I couldn't find any of this for more than one diploid individual.

petrelharp commented 1 month ago

Hello! Good question; the short answer is: it's a work in (active) progress; see for instance https://github.com/tskit-dev/tskit/pull/2985 .

However, what you're calculating is (I think) somewhat different - for instance, I don't think that the mean and variance are the mean and variance of the same distribution - not for instance, that mean depends on span and variance does not. The quantiles are, I think, quantiles of the distribution that's something like "pick a random tree and a random pair of individuals and find their inverse TMRCA"; this is different than "pick a random location on the genome and do the same thing". I have not thought through this in detail, but I wouldn't have called those inverse TMRCAs "rates" in this case? However, I may well be missing something.

I'll close this, but re-open if you like.

Luker121 commented 1 month ago

Hi, thanks for your fast reply. I appreciate the explanation on the distributions for the mean and variance. Here’s what I would do to get the weighted mean and weighted variance of my "inverse TMRCA":

For example:

# Span-weighted mean
mean_tmrca = np.sum(np.array(pairwise_tmrcas) * np.array(weights)) / np.sum(weights)

# Span-weighted variance
variance_tmrca = np.sum(weights * (np.array(pairwise_tmrcas) - mean_tmrca)**2) / np.sum(weights)

This way, the variance is based on the same span-weighted distribution as the mean. But this is not really what is meant with pairwise coalescent rates, right? It would rather be an inverse pairwise TMRCA as I am inverting coalescent times rather than calculating coalescent rates? In any case I am very much looking forward to a build-in function to extract (pairwise) coalescent rates from .trees files. :)

nspope commented 1 month ago

It looks like you're estimating (negative) moments of the pair coalescence time distribution, not coalescence rates. In either case, the way to do it more efficiently is to consider a weighted sum over nodes (where the weights are proportional to the average number of pairs coalescing at the node) rather than a weighted sum over pairs of samples per tree.

If it is actually pair coalescence rates you're after, these are pretty easily calculated from the empirical CDF (cumulative sum of weights with increasing node time). As @petrelharp says, these are in progress and should make it into the development version of tskit pretty shortly. If you wanted to work them out yourself (using node weights calculated in code snippet below), there's a description here.

However assuming that you're actually after negative moments, it'd be good to step back and ask if these actually exist. For example, let's say we're in an equilibrium population with haploid size $N$. Then under the coalescent model the distribution of time to coalescence $t$ of a pair is exponential with rate $1 / N$. In this case, $\mathbb{E}[t^{-1}] = \mathbb{E}[t^{-2}] = \infty$ (e.g. look at moments of an inverse gamma). So it's not clear to me that your weighted sum will converge to a meaningful expectation no matter how long the sequence.

But, let's say you wanted to calculate moments of a functional, that does exist (for example $\mathbb{E}[\log t]$, $\mathbb{V}[\log t]$, etc. In this case, you could calculate the node weights efficiently with the development version of tskit (installed from github),

import msprime
import numpy as np

ts = msprime.sim_ancestry(1000, population_size=1e4, recombination_rate=1e-8, sequence_length=5e7, random_seed=1)

# there is also a "windows" argument to this method
weights = ts.pair_coalescence_counts(pair_normalise=True, span_normalise=True, time_windows="nodes")
values = ts.nodes_time

# assume you want to estimate E[log(t)], V[log(t)] etc
weights = weights[values > 0]
values = np.log(values[values > 0])
mean = np.sum(weights * values) / np.sum(weights)
var = np.sum(weights * (values - mean) ** 2) / np.sum(weights)

As a sanity check, comparing against simulated iid pairs:

ts_pair = msprime.sim_ancestry(1, population_size=1e4, recombination_rate=0, sequence_length=1, num_replicates=10000, random_seed=2)
iid_values = np.array([ts_rep.nodes_time.max() for ts_rep in ts_pair])
mean_iid = np.mean(np.log(iid_values))
var_iid = np.var(np.log(iid_values), ddof=0)
print(f"iid mean {mean_iid}; full ts mean {mean}") 
print(f"iid var {var_iid}; full ts var {var}")
# iid mean 9.302555147007228; full ts mean 9.327968582745994
# iid var 1.67762330135097; full ts var 1.6448586251316892
nspope commented 1 month ago

fwiw there's now a ts.pair_coalescence_rates(time_windows, ...) method in the development version of tskit.

Luker121 commented 1 month ago

I'm encountering a segmentation fault while using the new pair_coalescence_rates function in the development version of tskit. The error occurs when I run my script to calculate pairwise coalescent rates over windows in a tree sequence. The error output suggests a bug in lib/tskit/trees.c at line 9800.

Environment

tskit version: `0.5.8.dev0`
Python version: `3.9`
Operating System: Fedora Linux 40

Error message:

Bug detected in lib/tskit/trees.c at line 9800. If you are using tskit directly please open an issue on GitHub, ideally with a reproducible example.

I have a SLiM-generated tree sequence .trees file. I'm using the pair_coalescence_rates function with the following parameters:

time_windows=np.array([0, float('inf')])
sample_sets: 4 populations
indexes: itertools.combinations_with_replacement(range(len(populations)), 2)
windows: sequence windows of size 200kb

The error occurs during the second window of analysis (200,000–400,000).

Here is a simplified version of my code to reproduce the error:

import tskit
import numpy as np
import itertools

# Load tree sequence
tree_seq = tskit.load("example.trees")

# Define populations (4 populations)
populations = [
    list(range(0, 10)), 
    list(range(10, 25)),
    list(range(25, 31)),
    list(range(31, 37))
]

# Define indexes for pairwise coalescence
indexes = list(itertools.combinations_with_replacement(range(len(populations)), 2))

# Calculate coalescent rates
windows = [0, 200000, 400000]
pairwise_coal_rates = tree_seq.pair_coalescence_rates(
    time_windows=np.array([0, float('inf')]),
    sample_sets=populations,
    indexes=indexes,
    windows=windows
)

I've tried running on different tree sequence files, but the error persists.

petrelharp commented 1 month ago

Ah ha, thanks - could you provide the example.trees file or the SLiM script that produces it?

Luker121 commented 1 month ago

ok, thanks a lot for looking into this. I have a .trees file that spans a 2Mb region. And I am trying to go along that 2Mb region in 200kb windows and calculate the pairwise coalescent rate. It stops when trying to estimate the coalescent rate after the first window (0-200kb). Here is a more thorough version of my script:


import tskit
import numpy as np
import itertools

tree_seq = tskit.load("example.trees")

populations = [
    list(range(0, 10)),  # Population 1
    list(range(10, 25)),  # Population 2
    list(range(25, 31)),  # Population 3
    list(range(31, 37))   # Population 4
]

# sample set indexes for pairwise coalescence
indexes = list(itertools.combinations_with_replacement(range(len(populations)), 2))

# windows along the genome
window_starts = np.arange(0, tree_seq.sequence_length, 200000)
window_ends = np.minimum(window_starts + 200000, tree_seq.sequence_length)
windows = np.column_stack((window_starts, window_ends))

# calculate pairwise coalescence rates and quantiles in windows
for window_index, (start, end) in enumerate(windows):
    print(f"\nWindow {window_index + 1}: Start = {start}, End = {end}")

    try:

        pairwise_coal_rates = tree_seq.pair_coalescence_rates(
            time_windows=np.array([0, 5000, 500000, float('inf')]), 
            sample_sets=populations,
            indexes=indexes,
            windows=[start, end]
        )

        quantiles = tree_seq.pair_coalescence_quantiles(
            [0.05, 0.95], 
            sample_sets=populations,
            indexes=indexes,
            windows=[start, end]
        )

        print("Coalescent rates:", pairwise_coal_rates)
        print("Quantiles:", quantiles)

    except Exception as e:
        print(f"Error in window {start}-{end}: {e}")

I attached the .trees file as .zip file. example_tree.zip

jeromekelleher commented 1 month ago

Should we open an issue to track this specific bug and make sure we catch it before release?

petrelharp commented 1 month ago

I'm leaving that to @nspope - maybe he's got this already dealt with?

nspope commented 1 month ago

Great, thanks for trying it out and reporting this @Luker121. The issue is that the windows argument needs to be an increasing list of windows going from 0 to the sequence length-- just like with the other stats, you can't do isolated windows (for the moment anyway). For example, this works:

#!/usr/bin/env bash

import tskit
import numpy as np
import itertools

tree_seq = tskit.load("example.trees")

populations = [
    list(range(0, 10)),  # Population 1
    list(range(10, 25)),  # Population 2
    list(range(25, 31)),  # Population 3
    list(range(31, 37)),   # Population 4
]

# sample set indexes for pairwise coalescence
indexes = list(itertools.combinations_with_replacement(range(len(populations)), 2))

# windows along the genome
window_breaks = np.arange(0, tree_seq.sequence_length, 200000)

# calculate pairwise coalescence rates and quantiles in windows
pairwise_coal_rates = tree_seq.pair_coalescence_rates(
    time_windows=np.array([0, 5000, 500000, float('inf')]), 
    sample_sets=populations,
    indexes=indexes,
    windows=window_breaks,
)

quantiles = tree_seq.pair_coalescence_quantiles(
    [0.05, 0.95], 
    sample_sets=populations,
    indexes=indexes,
    windows=window_breaks,
)

in the outputs, pairwise_coal_rates[i] would give you the output for the ith window.

It should definitely be erroring out with an informative message when you give it isolated windows, not segfaulting, so that's a bug!

Luker121 commented 1 month ago

Thanks a lot for your help, that’s working now. I have just one last question: Is there a way of getting some kind of CI or variance for the pairwise coalescent rate per time window?

nspope commented 1 month ago

Not a straightforward way at the moment, no. At the sequence level (or for a large enough window) you could block-bootstrap, which basically involves subdividing the genomic interval into blocks; using ts.pair_coalescence_counts to get weights per time interval per block; then resampling blocks, summing weights per time window over blocks, and using the formulas in the docstring for ts.pair_coaelscence_rates to get a bootstrap replicate for the weights. But I'm not sure how well that'd work with smaller windows.

While there are certainly ways to get confidence intervals for hazard rates in the survival literature (this routine is calculating a hazard rate, more or less), these all assume iid obervations -- and that is definitely not the case here