tskit-dev / msprime

Simulate genealogical trees and genomic sequence data using population genetic models
GNU General Public License v3.0
172 stars 84 forks source link

Coalescent rate trajectory ignores a population's activation time #2289

Open apragsdale opened 2 months ago

apragsdale commented 2 months ago

A population can have an activation time that is greater than zero -- for example, to sample ancient individuals. When using the demography debugger's coalescence_rate_trajectory() function, specifying lineages from that population has the unwanted behavior of allowing coalescence prior to the population's activation time. This means that the reported proportion of coalesced lineages can be less than 1 before the sampling time of that population.

For a demography built from demes, the population size is set to zero prior to activation, which leads to unreasonable outputs. There also seems to be some strange behavior with cross-population coalescences that I haven't gotten to the bottom of yet.

A short example:

import msprime, demes, numpy as np
b = demes.Builder()
b.add_deme("A", epochs=[dict(start_size=10000)])
b.add_deme("B", ancestors=["A"], start_time=2000, epochs=[dict(start_size=1000, end_time=100)])
g = b.resolve()

demog = msprime.Demography.from_demes(g)
db = demog.debug()

steps = np.linspace(100, 2100, 201)
coalrates, puncoal = db.coalescence_rate_trajectory(steps, lineages={"B": 2})

print(puncoal[0]) ## should be 1
print(np.exp(-0.5 * 100)) ## the actual value, which is the prob of no coal with min population size of 1

This also shows some strange outputs when asking for cross-coalescence rates:

db.coalescence_rate_trajectory(np.linspace(0, 200, 201), {"A": 1, "B": 1})
apragsdale commented 2 months ago

I meant to say, I'm happy to dig into this and open a PR with a fix. I think it will be pretty simple.

petrelharp commented 2 months ago

I meant to say, I'm happy to dig into this and open a PR with a fix. I think it will be pretty simple.

Awesome! A nice way to deal with this would be to allow a time specification for samples also, perhaps by allowing a SampleSet?