tskit-dev / msprime

Simulate genealogical trees and genomic sequence data using population genetic models
GNU General Public License v3.0
177 stars 88 forks source link

`from_ts` doesn't check for infinite coalescence time #649

Open petrelharp opened 5 years ago

petrelharp commented 5 years ago

If there are different roots in different populations, then you can set things up so that simulate(..., from_ts=ts) has an infinite coalescence time (because of zero migration between populations), in a way that would be detected if we weren't using from_ts.

Here is a MWE:

import msprime
import random

# as in recapitate()
recomb = msprime.RecombinationMap(positions = [0.0, 1e3], 
                                  rates = [0.001, 0.0],
                                  num_loci = int(1e3))

ts = msprime.simulate(4, recombination_map=recomb,
                      random_seed=23, __tmp_max_time=0.1)

# get the roots so we can assign some to different populations
roots = []
for t in ts.trees():
    # verify that the trees have not coalesced
    print(t.draw(format='unicode'))
    roots.extend(t.roots)

roots = list(set(roots))

# assign roots to different populations
tables = ts.dump_tables()
tables.populations.add_row()
population = tables.nodes.population
population[:] = [random.sample([0,1], 1)[0] for _ in range(ts.num_nodes)]
assert(len(set(population[roots])) > 1)
tables.nodes.set_columns(flags=tables.nodes.flags, population=population,
        individual=tables.nodes.individual, time=tables.nodes.time,
        metadata=tables.nodes.metadata, 
        metadata_offset=tables.nodes.metadata_offset)
nts = tables.tree_sequence()

def do_recap(ts):
    population_configurations = [msprime.PopulationConfiguration() 
                                 for _ in range(ts.num_populations)]
    msprime.simulate(
                from_ts = ts,
                population_configurations = population_configurations,
                recombination_map = recomb,
                start_time = max(ts.tables.nodes.time))

rts = do_recap(ts)
rnts = do_recap(nts)

I will track this down.

jeromekelleher commented 5 years ago

I don't quite follow: what's the problem here @petrelharp?

FYI, I edited to comment to turn on python syntax highlighting.

petrelharp commented 5 years ago

Whoops, the issue is deeper: it seems that passing a recombination rate disables the "infinite coalescence time" check??

>>> population_configurations = [msprime.PopulationConfiguration(sample_size=1),
...                              msprime.PopulationConfiguration(sample_size=3)]
>>> msprime.simulate(population_configurations = population_configurations)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/peter/.local/lib/python3.6/site-packages/msprime-0.6.2-py3.6-linux-x86_64.egg/msprime/simulations.py", line 422, in simulate
    sim, mutation_generator, 1, provenance_dict, __tmp_max_time))
  File "/home/peter/.local/lib/python3.6/site-packages/msprime-0.6.2-py3.6-linux-x86_64.egg/msprime/simulations.py", line 124, in _replicate_generator
    sim.run(max_time)
  File "/home/peter/.local/lib/python3.6/site-packages/msprime-0.6.2-py3.6-linux-x86_64.egg/msprime/simulations.py", line 677, in run
    self.ll_sim.run(max_time)
_msprime.LibraryError: Infinite waiting time until next simulation event.
>>> 
>>> msprime.simulate(population_configurations = population_configurations,
...             recombination_rate = 1.0)
jeromekelleher commented 5 years ago

There are situations where we can't tell if the waiting time is infinite, and you just have to kill the process. The reason we can't detect it here is because there is always another event possible in the two populations: recombination can split the single lineage into two or more pieces and common ancestor events will rejoin them. This is perfectly reasonable behaviour if there's an event at some point that'll bring the two populations together.

In principle we can detect this, but in practise I'm not sure it's worth the bother, as I can imagine the rules we come up with being triggered in unexpected ways in valid simulations. What do you think?

petrelharp commented 5 years ago

Ah, I see. The problem here is that lineages start in different connected components of the graph. That's something that would be pretty easy to check, but is really something people should be checking separately. I'm happy to do this in the future, but it's not top of the list.

petrelharp commented 5 years ago

Other people have run into this, in the context of recapitation, and I think that checking would be easy to do. We'd need to, if there is only one infinite epoch to simulate for:

  1. get the list of unique pops that lineages are currently in
  2. check if the submatrix of the migration matrix corresponding to those linages works (need to write down exactly how)