tskit-dev / msprime

Simulate genealogical trees and genomic sequence data using population genetic models
GNU General Public License v3.0
173 stars 85 forks source link

How to very quickly record marginal changes in tree height along a sequence? #2278

Closed trevorcousins closed 5 months ago

trevorcousins commented 5 months ago

If I am interested in recording a single locus' coalescent time distribution, it is easy to do lots of independent tree replicates, e.g:

for i in range(0,replicates):
    sim = msprime.sim_ancestry( # simulate ancestry
        samples=[msprime.SampleSet(num_samples=10 ploidy=1,population='human,time=0), \
                demography=demography,sequence_length=1)
    tree = sim.at(0)

However, I am interested in the distribution of tree heights at locus i+1 given locus i. In other words conditional on the current tree height, what is the distribution of the next tree height? Previously I recorded this by doing something like

L=1e+07 # arbitrary
r=1e-06 # arbitrary
sim = msprime.sim_ancestry( # simulate ancestry
    samples=[msprime.SampleSet(num_samples=10 ploidy=1,population='human,time=0), \
            demography=demography,sequence_length=L,recombination_rate=r)

then iterate through for tree in sim.trees() and record the height for each. It feels like there should be a faster way to do this, as the information must be built into what happens to a lineage after a recombination event. Even in my current method, it's not obvious what is the optimal L/r tradeoff to get the most transitions per unit time.

Any tips much appreciated!

Thanks,

Trevor

GertjanBisschop commented 5 months ago

Hi @trevorcousins,

the tree sequence format does not cache any information on the roots of the marginal trees. So one cannot simply iterate over all roots and ask for their node time (= tree height). That does mean indeed that sometimes you build the next tree while iterating just to find that the root has not changed. This should not be too big of an issue I think when the number of samples is not huge.

root_times = np.zeros(ts.num_trees)
for i, tree in enumerate(ts.trees()):
    root_times[i] = tree.time(tree.root)
trevorcousins commented 5 months ago

Thanks for the response Gertjan, though that's not really what I am asking. I'm wondering what's the quickest way to generate lots of marginal trees? Is it optimal to set e.g. L large and r low, or r large and L low? Or does it not matter?

GertjanBisschop commented 5 months ago

Aha, got it, sorry completely misunderstood your question. I think the quadratic scaling with LrNe in terms of runtime of the algorithm should not be affected by the relative contribution of L or r I think, that is when setting discrete_genome=False to make sure that each recombination event can hit a unique site. Couldn't you just run a whole bunch of simulations conditioning on there being at least one recombination and look at the height change going from the first tree to the next. This way you could parallelize things. You probably want to consider your choice of random seeds though when running a lot of replicates (see docs).