tskit-dev / msprime

Simulate genealogical trees and genomic sequence data using population genetic models
GNU General Public License v3.0
170 stars 84 forks source link

Demography.from_demes(): ancestral population sizes and migration rates missing #2255

Open simonharnqvist opened 5 months ago

simonharnqvist commented 5 months ago

Hi,

I'm having issues converting my Demes graphs to msprime Demography objects; both population sizes and migration rates are missing unexpectedly. My Demes graphs match what I think I'm trying to specify, so it looks to me like a bug in msprime.Demography.from_demes() rather than in Demes.

I first encountered this problem with msprime 1.2.0, and upgraded to 1.3.0 in a fresh environment, with the exact same behaviour. The versions of key packages for this example are:

demes                     0.2.3              pyhd8ed1ab_0    conda-forge
demesdraw                 0.4.0              pyhd8ed1ab_0    conda-forge
msprime                   1.3.0           py312h0767fef_0    conda-forge
python                    3.12.1          hdf0ec26_1_cpython    conda-forge
tskit                     0.5.6           py312hf635c46_0    conda-forge

Missing population sizes

I've specified a Demes graph with three current demes (a, b, c) and their ancestors (ab_anc and abc_anc):

b = demes.Builder(time_units="generations")
b.add_deme("abc_anc", epochs=[{"start_size":1e6, "end_time":1e5}])
b.add_deme("ab_anc", ancestors=["abc_anc"], epochs=[{"start_size":1e6,  "end_time":1e4}])
b.add_deme("a", ancestors=["ab_anc"],  epochs=[{"start_size":1e6}])
b.add_deme("b", ancestors=["ab_anc"],  epochs=[{"start_size":1e6}])
b.add_deme("c", ancestors=["abc_anc"], epochs=[{"start_size":1e6}])
graph = b.resolve()

The graph looks like what I was trying to create: image

And indeed population sizes are what we specified:

[(pop_idx, pop.epochs[0].start_size) for pop_idx, pop in enumerate(graph.demes)]
[(0, 1000000.0),
 (1, 1000000.0),
 (2, 1000000.0),
 (3, 1000000.0),
 (4, 1000000.0)]

But when converting to an msprime Demography, population sizes are not preserved for the ancestral populations:

msprime.Demography.from_demes(graph)

image

This is not just a table formatting error either:

[Population(initial_size=0, growth_rate=0, name='abc_anc', description='', extra_metadata={}, default_sampling_time=100000.0, initially_active=False, id=0),
 Population(initial_size=0, growth_rate=0, name='ab_anc', description='', extra_metadata={}, default_sampling_time=10000.0, initially_active=False, id=1),
 Population(initial_size=1000000.0, growth_rate=0, name='a', description='', extra_metadata={}, default_sampling_time=0.0, initially_active=True, id=2),
 Population(initial_size=1000000.0, growth_rate=0, name='b', description='', extra_metadata={}, default_sampling_time=0.0, initially_active=True, id=3),
 Population(initial_size=1000000.0, growth_rate=0, name='c', description='', extra_metadata={}, default_sampling_time=0.0, initially_active=True, id=4)]

I tried also specifying "end_size"...

b = demes.Builder(time_units="generations")
b.add_deme("abc_anc", epochs=[{"start_size":1e6, "end_time":1e5}])
b.add_deme("ab_anc", ancestors=["abc_anc"], epochs=[{"start_size":1e6,  "end_time":1e4}])
b.add_deme("a", ancestors=["ab_anc"],  epochs=[{"start_size":1e6, "end_size":1e6}])
b.add_deme("b", ancestors=["ab_anc"],  epochs=[{"start_size":1e6, "end_size":1e6}])
b.add_deme("c", ancestors=["abc_anc"], epochs=[{"start_size":1e6, "end_size":1e6}])

graph2 = b.resolve()

...but that made no difference.

Missing migration rates Adding some migration to the demes specification:

b = demes.Builder(time_units="generations")
b.add_deme("abc_anc", epochs=[{"start_size":1e6, "end_time":1e5}])
b.add_deme("ab_anc", ancestors=["abc_anc"], epochs=[{"start_size":1e6,  "end_time":1e4}])
b.add_deme("a", ancestors=["ab_anc"],  epochs=[{"start_size":1e6}])
b.add_deme("b", ancestors=["ab_anc"],  epochs=[{"start_size":1e6}])
b.add_deme("c", ancestors=["abc_anc"], epochs=[{"start_size":1e6}])
b.add_migration(rate=1e-5, source="a", dest="b")
b.add_migration(rate=1e-5, source="b", dest="a")
b.add_migration(rate=1e-5, source="c", dest="ab_anc")

graph3 = b.resolve()

demesdraw.tubes(graph3)

Again, demes builds the graph that I intended to specify: image

But the c -> ab_anc migration rate is missing: image

apragsdale commented 4 months ago

Hi @simonharnqvist -- thanks for opening the issue. I believe everything is working as expected, and those missing population sizes and migration rates will appear as the populations come into existence.

If we specify demog = msprime.Demography.from_demes(graph3), and then look at demog.events, we should see

[PopulationParametersChange(time=10000.0, initial_size=1000000.0, growth_rate=None, population='ab_anc'),
 PopulationSplit(time=10000.0, derived=['a', 'b'], ancestral='ab_anc'),
 MigrationRateChange(time=10000.0, rate=0, source='b', dest='a'),
 MigrationRateChange(time=10000.0, rate=0, source='a', dest='b'),
 MigrationRateChange(time=10000.0, rate=1e-05, source='ab_anc', dest='c'),
 PopulationParametersChange(time=100000.0, initial_size=1000000.0, growth_rate=None, population='abc_anc'),
 PopulationSplit(time=100000.0, derived=['c', 'ab_anc'], ancestral='abc_anc'),
 MigrationRateChange(time=100000.0, rate=0, source='ab_anc', dest='c')]

For example, population ab_anc is initialized at time zero with population size 0, since it won't exist until the split event involving populations a and b. At that time (10000), the population size is set the split event takes place. The appropriate migration rates are also turned on and off in the next few entries in the demog.events list (noting that the meaning of "source" and "dest" are reversed, since msprime looks backwards in time).

Hope this helps explain the behavior, but please let us know if there seem to still be issues or something else could be clarified.

jeromekelleher commented 4 months ago

I think @apragsdale is correct here, and it's just a quirk of how msprime represents these models. The 'debug()' method should resolve these problems - hopefully this section of the docs will clarify: https://tskit.dev/msprime/docs/stable/demography.html#debugging-tools

Does that help?

simonharnqvist commented 4 months ago

Thanks @apragsdale and @jeromekelleher for getting back to me so quickly. You are (of course) right that the population sizes aren't lost (and clearly I could have checked for that very easily - thanks for your patience).

Unless there is a logic to why models specified directly as a Demography object are displayed differently to those generated from Demes, can I suggest that the representations are standardised at some point to both show the full Demography table with all the parameter values? I find it very helpful to have the whole model represented, as it is when creating an msprime-native Demography.

That's an enhancement suggestion, however - so I'll close this bug report now. Thanks both!

jeromekelleher commented 4 months ago

What was your resolution here @simonharnqvist - does the debug() output show what you expect? I'm not sure there's a difference between demes and non-demes models, I think it's probably just how msprime does things generally. Maybe this is something we need to be clearer about in the docs?

simonharnqvist commented 4 months ago

@jeromekelleher - both .events and .debug() shows that msprime indeed creates the model I want

I'm not sure I've understood your point correctly here - this is not the behaviour if I specify a model directly in msprime. Let's create the first model without migration directly as an msprime Demography:

msprime_demography = msprime.Demography()
msprime_demography.add_population(initial_size=1e6, name="abc_anc")
msprime_demography.add_population(initial_size=1e6, name="ab_anc")
msprime_demography.add_population(initial_size=1e6, name="a")
msprime_demography.add_population(initial_size=1e6, name="b")
msprime_demography.add_population(initial_size=1e6, name="c")
msprime_demography.add_population_split(time=1e6, ancestral="abc_anc", derived=["ab_anc", "c"])
msprime_demography.add_population_split(time=1e5, ancestral="ab_anc", derived=["a", "b"])
msprime_demography.sort_events()

Here the representation of the Demography object includes all the population sizes (which IMO is far less confusing): image

If I convert this to a Demes graph and then back to a Demography...

demesgraph = msprime_demography.to_demes()
msprime.Demography.from_demes(demesgraph)

... the representation now loses its population sizes: image

If this is intended or expected behaviour, then explaning this in the docs would be very helpful - and perhaps clarifying to users whether this would have any effects on downstream simulations.

jeromekelleher commented 4 months ago

Ah thanks - that is confusing! I suspect there may be some slight semantic differences here in the Demes model vs msprime. Let's keep this open to see if we can do any better.