tskit-dev / msprime

Simulate genealogical trees and genomic sequence data using population genetic models
GNU General Public License v3.0
176 stars 87 forks source link

Declarative demography #1058

Closed jeromekelleher closed 4 years ago

jeromekelleher commented 4 years ago

After a discussion with @grahamgower this morning, we decided it would be nice to have a declarative structure for demography, where we describe what our populations are and how they relate to each other. Here's a first pass at doing the OOA model using a toml description:

Warning: the parameters are WRONG - DO NOT COPY THEM!!

description = "Gutenkunst et al three population Out-of-Africa"
generation_time = 25
time_units = "kya"

[populations]
    [populations.ancestral_human]
        size = 7300
        time = 220

    [populations.ancestral_african]
        ancestor = "ancestral_human"
        population_size = 12300
        time = 140

    [populations.ancestral_eurasian]
        ancestor = "ancestral_africans"
        size = 12300
        time = 21.2

    [populations.YRI]
        ancestor = "ancestral_africans"
        size = 12300

    [populations.CEU]
        ancestor = "ancestral_eurasian"
        size = 1000
        growth_rate = 0.0055

    [populations.CHB]
        ancestor = "ancestral_eurasian"
        size = 510
        growth_rate = 0.004

# The limitations of toml here become apparent when we try to 
# declare some migration relationships between these populations.
[migration]
    symmetric = [
        # Annoyingly, the rate must be enclosed in a list for toml
        [["ancestral_african", "ancestral_eurasian"], [25e-5]],
        [["YRI", "CEU"], [3e-5]],
        [["YRI", "CHB"], [1.9e-5]],
        [["CEU", "CHB"], [9.6e-5]],
    ]
    # Could have asymmetric migration here also

    # Sticking in an admixture event, just to see if we can handle mass migration.
    admixture = [
        # (source, dest), (time, fraction). We need this lameness 
        # because toml won't accept mixed array types. This isn't 
        # great.
        [["YRI", "CEU"], [1.0, 0.1]]
    ]

Some parts of this are quite nice, and others are pretty nasty. I think toml is too restrictive and will have another go in a minute with yaml.

This is an inherently graphical description of the populations, inspired by @apragsdale's approach in the demography package. A "population" in this context is modelled as a unit in which the parameters don't change and no mergers with other populations occur. Each population has an ancestor attribute, which points to the population that it was derived from. There is also a time attribute, which (forwards in time) is the time at which this population goes extinct (if it hasn't split into other populations).

Migration is treated separately to this, and set up as relationships between the populations, outside the population inheritance tree. This ended up being ugly to express in toml, but it might be easier to do in yaml, say.

Any thoughts? Is this population inheritance tree a useful tool, or just overly simplistic?

grahamgower commented 4 years ago

The migration bits could be put in the population stanzas. Something more like this (ommitting the 'popluations' prefix for brevity):


# Anatomically modern humans
[AMH]
time = 70_000

# Ancestral out-of-Africa population
[OOA]
ancestor = "AMH"
time = 50_000

# Yoruba African
[YRI]
ancestor = "AMH"

    [[YRI.migration]]
    dest = "CEU"
    rate = 1e-4

    [[YRI.migration]]
    dest = "CHB"
    rate = 1e-4

# Western European
[CEU]
ancestor = "OOA"

    # constant migration rate from CEU -> CHB
    [[CEU.migration]]
    dest = "CHB"
    rate = 1e-5

    # pulse of back migration to Africa
    [[CEU.admixture]]
    time = 5_000
    dest = "YRI"
    rate = 0.002

# Han Chinese
[CHB]
ancestor = "OOA"

    # constant migration rate from CHB -> CEU
    [[CHB.migration]]
    dest = "CEU"
    rate = 1e-5

Convert to json:

import sys
import toml
import json

ooa = toml.load(sys.stdin)
json.dump(ooa, sys.stdout, indent=4)

Which looks like this:

{
    "AMH": {
        "time": 70000
    },
    "OOA": {
        "ancestor": "AMH",
        "time": 50000
    },
    "YRI": {
        "ancestor": "AMH",
        "migration": [
            {
                "dest": "CEU",
                "rate": 0.0001
            },
            {
                "dest": "CHB",
                "rate": 0.0001
            }
        ]
    },
    "CEU": {
        "ancestor": "OOA",
        "migration": [
            {
                "dest": "CHB",
                "rate": 1e-05
            }
        ],
        "admixture": [
            {
                "time": 5000,
                "dest": "YRI",
                "rate": 0.002
            }
        ]
    },
    "CHB": {
        "ancestor": "OOA",
        "migration": [
            {
                "dest": "CEU",
                "rate": 1e-05
            }
        ]
    }
}
jeromekelleher commented 4 years ago

That is neater all right. I started out thinking like this, but got a bit worried about putting the same migration rate in twice. Since symmetric migration between two populations is very common in these models, it seems error prone to force people to write it in twice, doesn't it?

grahamgower commented 4 years ago

Yeah, I suppose so. Another format option could be to use python's ast.literal_eval, which evaluates (nested) literal python data structures. The json code above can be directly parsed into a dictionary, for example. Compared with json however, comments starting with # are accepted, and dictionary keys need not be strings.

But maybe starting from the spec language is wrong. What do you think the ideal demography declaration would look like?

grahamgower commented 4 years ago

Below is a yaml example. We could use strictyaml to avoid common yaml problems. But a toml, yaml, etc. format means no user-defined variables, and maths functions aren't available for e.g. growth_rate = log(end_size / start_size) / time_delta. So after playing with some examples, I think I'd favour having a declarative-style Python API instead. Users are already familiar with Python, it's a very expressive language, and the example representations in this thread do map nicely to attrs data classes.

In any event, some things I'd take away from this exercise are:

description: This is a three-population example.
time_units: kya
generation_time: 0.001  # also in units of `time_units`.
# default size, unless changed by a population-specific field
initial_size: 1000  # implies the default growth_rate=0

populations:
  - A: { extinction_time: 20 }
  - B: { ancestor: A }
  - C:
      ancestor: A
      initial_size: 12
      growth_rate: -0.25
      population_size_change:
        - { time: 10, initial_size: 1000, growth_rate: 0 }
        - { time: 11, initial_size: 2000, growth_rate: 0 }
        - { time: 12, initial_size: 1000, growth_rate: 0 }
      migration_rate_change:
        - { time: 15, dest: B, rate: 0.03 }
        - { time: 16, dest: B, rate: 0.04 }

# symmetric migrations
migration:
  # active when B and C are both alive, unless overwritten elsewhere
  - { populations: [B, C], rate: 1e-3 }
  # active from the specified time (long form syntax)
  - time: 8
    populations:
      - B
      - C
    rate: 2e-3

Parse with strictyaml and convert to json:

import sys
import json

from strictyaml import (
    Optional, Map, MapPattern, Float, Seq, Str, dirty_load
)

schema = Map({
    "description": Str(),
    "time_units": Str(),
    Optional("generation_time"): Float(),
    Optional("initial_size"): Float(),
    Optional("growth_rate"): Float(),
    "populations": Seq(MapPattern(Str(), Map({
        Optional("ancestor"): Str(),
        Optional("extinction_time"): Float(),
        Optional("initial_size"): Float(),
        Optional("growth_rate"): Float(),
        Optional("population_size_change"): Seq(Map({
            "time": Float(),
            "initial_size": Float(),
            "growth_rate": Float(),
        })),
        Optional("migration_rate_change"): Seq(Map({
            "time": Float(),
            "dest": Str(),
            "rate": Float(),
        })),
    }))),
    Optional("migration"): Seq(Map({
        Optional("time"): Float(),
        "populations": Seq(Str()),
        "rate": Float(),
    })),
})

s = sys.stdin.read()
d = dirty_load(s, schema, allow_flow_style=True)
json.dump(d.data, sys.stdout, indent=4)

json equivalent:

{
    "description": "This is a three-population example.",
    "time_units": "kya",
    "generation_time": 0.001,
    "initial_size": 1000.0,
    "populations": [
        {
            "A": {
                "extinction_time": 20.0
            }
        },
        {
            "B": {
                "ancestor": "A"
            }
        },
        {
            "C": {
                "ancestor": "A",
                "initial_size": 12.0,
                "growth_rate": -0.25,
                "population_size_change": [
                    {
                        "time": 10.0,
                        "initial_size": 1000.0,
                        "growth_rate": 0.0
                    },
                    {
                        "time": 11.0,
                        "initial_size": 2000.0,
                        "growth_rate": 0.0
                    },
                    {
                        "time": 12.0,
                        "initial_size": 1000.0,
                        "growth_rate": 0.0
                    }
                ],
                "migration_rate_change": [
                    {
                        "time": 15.0,
                        "dest": "B",
                        "rate": 0.03
                    },
                    {
                        "time": 16.0,
                        "dest": "B",
                        "rate": 0.04
                    }
                ]
            }
        }
    ],
    "migration": [
        {
            "populations": [
                "B",
                "C"
            ],
            "rate": 0.001
        },
        {
            "time": 8.0,
            "populations": [
                "B",
                "C"
            ],
            "rate": 0.002
        }
    ]
}
jeromekelleher commented 4 years ago

This is very neat, thanks @grahamgower! This is a little bit different to the graph inspired approach we talked about, where a "population" (as far as this description is concerned) had fixed parameters, and any instantaneous changes meant a new population. Maybe we could call it a "deme" or something as then we can keep the term Population as it is in msprime, currently. This seems nice to me as it means that there's no duplication of time values across different aspects of the demography.

Here's a YAML example:

---
description: This is a three-population Out of Africa example
  with parameters that are not correct.
time_units: kya
generation_time: 0.001  # also in units of `time_units`.

populations:
  - AMH:
    description: Anatomically modern humans
    start_time: 70
    initial_size: 7300

  - OOA:
    ancestor: AMH
    description: Out-of-Africa population
    start_time: 50
    initial_size: 12300

  - YRI:
    ancestor: AMH
    description: Yoruba African
    initial_size: 12300

  - CEU:
    ancestor: OOA
    description: Western European
    initial_size: 1000
    growth_rate: 0.0055

  - CHB:
    ancestor: OOA
    description: East Asian
    initial_size: 510
    growth_rate: 0.004

migrations:
  symmetric:
    - {populations: [YRI, OOA], rate: 25e-5}
    - {populations: [YRI, CEU], rate: 1.9e-5}
    - {populations: [CEU, CHB], rate: 9.6e-5}

  admixture:
    # pulse of back migration to Africa
    # NOTE source and dest are probably inverted from their usage in MassMigration.
    - {time: 5, source: CEU, dest: YRI, rate: 0.002}

This seems quite neat? There's still a bunch of things that are confusing (like, initial_size refers to the size at the extinction time). I've also left out the times here (extinction_time doesn't feel quite right - ancestrally modern humans didn't go extinct) also.

The one thing I'm not sure is a real problem or not is the two different migration rates between YRI <-> CEU and YRI<->OOA. I'm not sure we could figure this out correctly, which is why I think having a different population for ancestral Africans as well could be a good idea.

grahamgower commented 4 years ago

This is a little bit different to the graph inspired approach we talked about, where a "population" (as far as this description is concerned) had fixed parameters, and any instantaneous changes meant a new population.

For some reason, I had it in my mind that this was just referring just to msprime's integer population IDs for internal nodes of the demography, like AMH and OOA above. So you're suggesting that each population size change introduces a new population ID as well? So a zizag model would have a new ID for each size change?

Being able to define multiple demes for a given population (ID) would be a good way forward I think. It would formalise what's currently implicit. The parameters that would need to be fixed per deme are (initial_size, growth_rate, migration_rates_this_deme_is_source, migration_rates_this_deme_is_dest). Or maybe just one of the migration objects?

This seems quite neat? There's still a bunch of things that are confusing (like, initial_size refers to the size at the extinction time). I've also left out the times here (extinction_time doesn't feel quite right - ancestrally modern humans didn't go extinct) also.

That looks very reasonable. It might be best to keep backwards-time conventions and terminology here, otherwise more confusion could arise. So, start_time for a population's introduction, initial_size for the deme's size at start_time, and migration source/dest match MassMigration.

The one thing I'm not sure is a real problem or not is the two different migration rates between YRI <-> CEU and YRI<->OOA. I'm not sure we could figure this out correctly, which is why I think having a different population for ancestral Africans as well could be a good idea.

I'd be happy for this to implicitly create new demes under YRI. I think that behaviour is just as reasonable as requiring the user to specify the demes themselves. It just needs to be clearly documented.

jeromekelleher commented 4 years ago

For some reason, I had it in my mind that this was just referring just to msprime's integer population IDs for internal nodes of the demography, like AMH and OOA above. So you're suggesting that each population size change introduces a new population ID as well? So a zizag model would have a new ID for each size change?

Maybe - there's a bit of wiggle-room here and it's good to explore the ideas. I think we should concentrate here on what we think is the most expressive and error-free way of describing the population histories, and then figure out how to translate this pack to tskit/msprime populations later. I have no qualms about creating lots of populations - this is something msprime should be handling better anyway.

I'm keen to see what @apragsdale thinks of our ideas so far. Aaron, any thoughts?

apragsdale commented 4 years ago

I like where this is headed. When I've worked through some of these similar questions for building the networkx demography objects, if a population has a change in size function or migration rate, I found it easiest to just define those different epoch as separate populations and connect them with an edge. But it's kind of clunky, because you can have some parameters (say migration rates) that stay the same over those different epochs, while others change.

For the issue of migration between populations whose existence times do not overlap (thinking of the YRI/OOA/CEU/CHB case here), I don't know if you need to split the YRI into two populations. YRI could have migration rates to each OOA, CEU, and CHB, and since those populations each have their own interval of time that they exist, is should be straightforward to figure out how to handle the changes in the migration matrix at the time of the OOA->(CEU,CHB) split. Also, what if there is an additional population (say some other population in Africa that exists over the same interval as YRI) that exchanges migrants that whole time (call it AFR2)? If you split YRI, then you'd have to copy the same migration rate information to YRI0->AFR2 and YRI1->AFR2. Some migration rates may change, but others may not between the two epochs.

Not sure if that makes sense.. maybe it would be easiest to have a call to discuss together?

jeromekelleher commented 4 years ago

Not sure if that makes sense.. maybe it would be easiest to have a call to discuss together?

Good call, let's do it. I'll email to coordinate.

jeromekelleher commented 4 years ago

FYI: populations are now officially cheap, as of #1069. We should use as many as makes the job of modelling the demography as easy as possible.

apragsdale commented 4 years ago

Another thought about how to specify population growth/decay, in favor of specifying start size and end size over an epoch instead of initial size and growth rate: In discrete forward simulators, a growth rate of r means that N[t+1] = N[t] * (1+r), so N[t+g] = N[t] * (1+r) ** g, if g is the number of generations later. In a continuous model, like msprime, the size at generation t+g would be N[t+g] = N[t] * exp(r*g), and those sizes will be slightly different between a continuous vs discrete model. By specifying sizes at start and end of simulation instead, we'd leave it up to the simulation software to figure out the appropriate r.

Edit, extra thought: And you could imagine then more easily allowing flexibility to specify wheter the growth is exponential, linear, or whatever (although I suppose msprime only allows constant size or exponential size change).

jeromekelleher commented 4 years ago

Good points @apragsdale.

apragsdale commented 4 years ago

Playing around with some options for how to specify demographies (with four examples, taken from stdpopsim, below). A few thoughts:

These are of course just drafts, happy to discuss how it could be improved, and I'm not set on anything in particular. Writing out examples has helped clarify some possible snags of ways we discussed over the call..

Gutenkunst OOA

description: The Gutenkunst et al. (2009) three-population model of human history. Time
  is given in years in the past. Coices to make, 1) which ends of the epochs should
  start and end times describe, 2) once that's decided, initial and final sizes should
  correspond to start and end time, resp. Currently, I'm looking backward in time, so
  start time is more recent and end time is the epoch interval end point farther in the
  past.  
doi: 10.1371/journal.pgen.1000695
time_unit: years
generation_time: 25
mutation_rate: 2.35e-8

populations:
  - ancestral:
    description: Equilibrium/root population
    start_time: 220e3
    initial_size: 7300

  - AMH:
    description: Anatomically modern humans
    ancestors: ancestral
    start_time: 140e3
    end_time: 220e3
    initial_size: 12300

  - OOA:
    description: Bottleneck out-of-Africa population
    ancestors: AMH
    start_time: 21.2e3
    end_time: 140e3
    initial_size: 2100

  - YRI:
    description: Yoruba in Ibadan, Nigeria
    ancestors: AMH
    start_time: 0
    end_time: 140e3
    initial_size: 12300

  - CEU:
    description: Utah Residents (CEPH) with Northern and Western European Ancestry
    ancestors: OOA
    start_time: 0
    end_time: 21.2e3
    initial_size: 29725
    final_size: 1000

  - CHB
    description: Han Chinese in Beijing, China
    ancestors: OOA
    start_time: 0
    end_time: 21.2e3
    initial_size: 54090
    final_size: 510

migration:
  - symmetric:
    unit: per generation
    rates:
      - {YRI, OOA}: 25e-5
      - {YRI, CEU}: 3e-5
      - {YRI, CHB}: 1.9e-5
      - {CEU, CHB}: 9.6e-5

Zig zag:

description: A single population model with epochs of exponential growth and decay.
  There are a few ways we could define this demography, 1) each epoch is its own
  "population", or 2) populations have optional epochs, over which we define start and
  end sizes. I give both options below, and maybe we'd like to allow the flexibility to
  define models using either approach. Parameters taken from table in stdpopsim docs.
  Note: realizing there is a bit of an inconsistency where I'm listing epochs in
  opposite order from how I'm measuring time. However, we should be able to list
  populations in any order and not enforce something arbitrary.
doi: 10.1038/ng.3015
time_units: generations
generation_time: 29
generation_time_unit: years

##########
# option 1
##########

populations:
  - ancient
    description: Ancient population with constant size
    start_time: 34133
    initial_size: 1431

  - epoch_1
    description: First epoch, growth
    ancestor: ancient
    start_time: 8533.33
    end_time: 34133.31
    initial_size: 34133
    final_size: 1431

  - epoch_2
    description: Second epoch, decline
    ancestor: epoch_1
    start_time: 2133.33
    end_time: 8533.33
    initial_size: 1431
    final_size: 34133

  - epoch_3
    description: Third epoch, growth
    ancestor: epoch_2
    start_time: 533.33
    end_time: 2133.33
    initial_size: 34133
    final_size: 1431

  - epoch_4
    description: Fourth epoch, decline
    ancestor: epoch_3
    start_time: 133.33
    end_time: 533.33
    initial_size: 1431
    final_size: 34133

  - epoch_5
    description: Fifth epoch, growth
    ancestor: epoch_4
    start_time: 33.333
    end_time: 133.33
    initial_size: 34133
    final_size: 1431

  - epoch_6
    description: Final epoch, constant size
    ancestor: epoch_5
    start_time: 0
    end_time: 33.333
    initial_size: 34133

##########
# option 2
##########

populations: 
  - generic
    description: All epochs wrapped into the same population, so that epoch intervals
      do not overlap, and they tile the entire existence of the population (all time,
      in this case).
    start_time: 0
    epochs:
      - (0, 33.333):
        initial_size: 14312
      - (33.333, 133.33):
        initial_size: 14312
        final_size: 1431
      - (133.33, 533.33):
        initial_size: 1431
        final_size: 14312
      - (533.33, 2133.33):
        initial_size: 14312
        final_size: 1431
      - (2133.33, 8533.33):
        initial_size: 1431
        final_size: 14312
      - (8533.33, 34133.31):
        initial_size: 14312
        final_size: 1431
      - (34133.31, ):
        initial_size: 1431

Tennessen

description: The Tennessen et al. (2012) model of recent fast growth in African and
  European populations.
doi: 10.1126/science.1219240
time_unit: generations
generation_time: 25
generation_time_unit: years

populations:
  - ancestral:
    description: Equilibrium/root population
    start_time: 5920
    initial_size: 7310

  - AMH:
    description: Anatomically modern humans
    ancestors: ancestral
    start_time: 2040
    end_time: 5920
    initial_size: 14474

  - AFR:
    description: African population
    ancestors: AMH
    start_time: 0
    end_time: 2040
    epochs:
      - (0, 204.6):
        initial_size: 432125
        final_size: 14474
      - (204.6, 2040):
        initial_size: 14474

  - EUR:
    description: European population
    ancestors: AMH
    start_time: 0
    end_time: 2040
    epochs:
      - (0, 204.6):
        initial_size: 501436
        final_size: 9279
      - (204.6, 920):
        initial_size: 9279
        final_size: 1032
      - (920, 2040):
        initial_size: 1861

migration:
  - symmetric:
    unit: per generation
    rates:
      - {AFR, EUR}:
        epochs:
          - (0, 920): 2.5e-5
          - (920, 2040): 15e-5

Browning:

description: The Browning et al. (2011) model of admixture in the Americas.
doi: 10.1371/journal.pgen.1007385
time_unit: generations
generation_time: 25
generation_time_unit: years

populations:
  - ancestral:
    description: Equilibrium/root population
    start_time: 5920
    initial_size: 7310

  - AMH:
    description: Anatomically modern humans
    ancestors: ancestral
    start_time: 2040
    end_time: 5920
    initial_size: 14474

  - OOA:
    description: Bottleneck out-of-Africa population
    ancestors: AMH
    start_time: 920
    end_time: 2040
    initial_size: 1861

  - AFR:
    description: African population
    ancestors: AMH
    start_time: 0
    end_time: 2040
    initial_size: 14474

  - EUR:
    description: European population
    ancestors: OOA
    start_time: 0
    end_time: 920
    initial_size: 34039
    final_size: 1000

  - EAS:
    description: East Asian population
    ancestors: OOA
    start_time: 0
    end_time: 920
    initial_size: 45852
    final_size: 510

  - ADMIX:
    description: Admixed America
    ancestors: AFR, EUR, EAS
    ancestor_proportions: 1/6, 1/3, 1/2
    start_time: 0
    end_time: 12
    initial_size: 54664
    final_size: 30000

migration:
  - symmetric:
    unit: per generation
    rates:
      - {YRI, OOA}: 15e-5
      - {YRI, CEU}: 2.5e-5
      - {YRI, CHB}: 0.78e-5
      - {CEU, CHB}: 3.11e-5
grahamgower commented 4 years ago

Thanks Aaron, that's really useful. I like the idea of having migration epochs too.

Having a start_time and an end_time for each population/epoch makes things explicit, which I like. On the other hand, it also means there's lots of duplication. Perhaps we should be able to refer to attributes defined for other populations? In your last example, how about something like this:

  - AMH:
    description: Anatomically modern humans
    ancestors: ancestral
    start_time: 2040
    end_time: ancestral.start_time
    initial_size: 14474

There's lots of duplicated numbers in the ZigZag epochs, and avoiding that might need a different aproach.

I also note that in the Browning model you have an end_time for the ADMIX population which doesn't match the ancestors' start_times. So this creates three implicit internal nodes in the graph, right? I suspect this will break the desired declarative-demography <-> in-memory graph bijection.

jeromekelleher commented 4 years ago

This is really nice, thanks @apragsdale. It's great to have concrete examples.

Re start and end times, it seems reasonable that the end_time for a child population is equal to the start_time of the parent population (otherwise it's unclear what the parent population relationship means). So we shouldn't put those numbers in twice.

Similarly, for the epochs, we should write down just one number I think. Avoiding duplication of the numbers is a key goal for this representation.

apragsdale commented 4 years ago

Having a start_time and an end_time for each population/epoch makes things explicit, which I like. On the other hand, it also means there's lots of duplication. Perhaps we should be able to refer to attributes defined for other populations? In your last example, how about something like this:

  - AMH:
    description: Anatomically modern humans
    ancestors: ancestral
    start_time: 2040
    end_time: ancestral.start_time
    initial_size: 14474

I like that idea!

There's lots of duplicated numbers in the ZigZag epochs, and avoiding that might need a different aproach.

There is quite a bit of duplication of epoch interval times and sizes at those interval break points. We may be able to get away with implying that if an initial size of a new epoch isn't set, that it's equal to the final size of the previous epoch, although we want to keep the flexibility of allowing both to be explicitly provided to handle instantaneous size changes.

I also note that in the Browning model you have an end_time for the ADMIX population which doesn't match the ancestors' start_times. So this creates three implicit internal nodes in the graph, right? I suspect this will break the desired declarative-demography <-> in-memory graph bijection.

Re start and end times, it seems reasonable that the end_time for a child population is equal to the start_time of the parent population (otherwise it's unclear what the parent population relationship means). So we shouldn't put those numbers in twice.

Yeah, I've been going back and forth on this point. I think the alternative (requiring an ancestor's start time to be the equal to the derived population's end time), is a bit restrictive. I think we'd want to only require that the ancestor's time interval overlaps with the derived population's end time. The Browning model is a good example of why - we don't want to have to split the three source populations at t=12 just to handle that admixture event, since the parameters of those populations are unchanged otherwise.

Similarly, for the epochs, we should write down just one number I think. Avoiding duplication of the numbers is a key goal for this representation.

Yes, I see what you mean. And for epochs within a population, the start/end time equality is more obvious. I'll have a go at reconfiguring to avoid duplication in these examples.

apragsdale commented 4 years ago

Zig Zag option 3, reducing duplication of entered values. If an epoch's initial size is not given, it is assumed to be equal to the final size of the previous epoch.

##########
# option 3
# The ends of epoch intervals are implied by the beginning of the next interval. If the
# initial size of an epoch is not given, it's assumed to be equal to the final size in
# the previous epoch. Although... even though this means we do not duplicate values, I
# think it is more confusing to have those sizes implied instead of explicitly written.
# Both approaches (option 2 and option 3) should probably be allowed.
##########

populations: 
  - generic
    description: Sequential exponential increase and decrease of population size.
    start_time: 0
    epochs:
      - (0, ):
        initial_size: 14312
      - (33.333, ):
        final_size: 1431
      - (133.33, ):
        final_size: 14312
      - (533.33, ):
        final_size: 1431
      - (2133.33, ):
        final_size: 14312
      - (8533.33, ):
        final_size: 1431
      - (34133.31, ):
        initial_size: 1431

Or the same Browning model (populations) with Graham's suggestion:

populations:
  - ancestral:
    description: Equilibrium/root population
    start_time: 5920
    initial_size: 7310

  - AMH:
    description: Anatomically modern humans
    ancestors: ancestral
    start_time: 2040
    end_time: ancestral.start_time
    initial_size: 14474

  - OOA:
    description: Bottleneck out-of-Africa population
    ancestors: AMH
    start_time: 920
    end_time: AMH.start_time
    initial_size: 1861

  - AFR:
    description: African population
    ancestors: AMH
    start_time: 0
    end_time: AMH.start_time
    initial_size: 14474

  - EUR:
    description: European population
    ancestors: OOA
    start_time: 0
    end_time: OOA.start_time
    initial_size: 34039
    final_size: 1000

  - EAS:
    description: East Asian population
    ancestors: OOA
    start_time: 0
    end_time: OOA.start_time
    initial_size: 45852
    final_size: 510

  - ADMIX:
    description: Admixed America
    ancestors: AFR, EUR, EAS
    ancestor_proportions: 1/6, 1/3, 1/2
    start_time: 0
    end_time: 12
    initial_size: 54664
    final_size: 30000

In this case, we could maybe even drop the end_time = OOA.start_time, and say that if no end time is given, it's either infinite if it doesn't have an ancestor, or implicitly equal to the ancestor's start time. But then what if there are multiple ancestors that have different start times...? Lots of corner cases that we could imagine, and maybe being explicit at the expense of a little bit of duplication/conciseness is warranted.

jeromekelleher commented 4 years ago

This is looking very neat @apragsdale. Something I'm still not clear about though: what does it mean to be an ancestor of a population? Since the population and its ancestors can now coexist, I'm not sure I see much value in the concept. Really this is just modelling some sort of migration relationship, so perhaps it'd be better to be explicit about that?

The "parent" population made sense to me in a strict species-tree type scenario, but in the Browning model it just seems confusing.

apragsdale commented 4 years ago

This is looking very neat @apragsdale. Something I'm still not clear about though: what does it mean to be an ancestor of a population?

I guess source would be a better term. A source could be a direct ancestor, in which case they don't overlap, or perhaps a population that provides migrants to a new population (more of a "PopulationPeel" instead of a "PopulationSplit" scenario, for a fun banana theme).

The "parent" population made sense to me in a strict species-tree type scenario, but in the Browning model it just seems confusing.

I want to avoid having to split the source populations into pre- and post-admixture event in the Browning model. I think changing the ancestor attribute to source might make that more clear?

molpopgen commented 4 years ago

I quite like the zig-zag v3 with epochs. It seems more natural than assigning new population labels. For my part, I'd like to be able to map a population name to an integer (deme index) and rely on that being constant.

grahamgower commented 4 years ago

Here a python version that incorporates lots of ideas from this thread. https://gist.github.com/grahamgower/879f2a5790f8987284c47a1061fcf765

molpopgen commented 4 years ago
In [1]: import strictyaml                                                                                                                                                  

In [2]: with open("zigzag.yml") as f: 
   ...:     d = f.read() 
   ...:                                                                                                                                                                    

In [3]: d                                                                                                                                                                  
Out[3]: 'populations: \n  - generic:\n    description: Sequential exponential increase and decrease of population size.\n    start_time: 0\n    epochs:\n      - (0, ):\n        initial_size: 14312\n      - (33.333, ):\n        final_size: 1431\n      - (133.33, ):\n        final_size: 14312\n      - (533.33, ):\n        final_size: 1431\n      - (2133.33, ):\n        final_size: 14312\n      - (8533.33, ):\n        final_size: 1431\n      - (34133.31, ):\n        initial_size: 1431\n'

In [4]: y = strictyaml.load(d)                                                                                                                                             

In [5]: y                                                                                                                                                                  
Out[5]: YAML(OrderedDict([('populations', [OrderedDict([('generic', ''), ('description', 'Sequential exponential increase and decrease of population size.'), ('start_time', '0'), ('epochs', [OrderedDict([('(0, )', ''), ('initial_size', '14312')]), OrderedDict([('(33.333, )', ''), ('final_size', '1431')]), OrderedDict([('(133.33, )', ''), ('final_size', '14312')]), OrderedDict([('(533.33, )', ''), ('final_size', '1431')]), OrderedDict([('(2133.33, )', ''), ('final_size', '14312')]), OrderedDict([('(8533.33, )', ''), ('final_size', '1431')]), OrderedDict([('(34133.31, )', ''), ('initial_size', '1431')])])])])]))

In [8]: y.data                                                                                                                                                             
Out[8]: 
OrderedDict([('populations',
              [OrderedDict([('generic', ''),
                            ('description',
                             'Sequential exponential increase and decrease of population size.'),
                            ('start_time', '0'),
                            ('epochs',
                             [OrderedDict([('(0, )', ''),
                                           ('initial_size', '14312')]),
                              OrderedDict([('(33.333, )', ''),
                                           ('final_size', '1431')]),
                              OrderedDict([('(133.33, )', ''),
                                           ('final_size', '14312')]),
                              OrderedDict([('(533.33, )', ''),
                                           ('final_size', '1431')]),
                              OrderedDict([('(2133.33, )', ''),
                                           ('final_size', '14312')]),
                              OrderedDict([('(8533.33, )', ''),
                                           ('final_size', '1431')]),
                              OrderedDict([('(34133.31, )', ''),
                                           ('initial_size', '1431')])])])])])

Here's what the last version of zig-zag looks like once we get it into Python. We're pretty close to something workable, but it is still very "object-y", meaning that a good bit of logic is still required to parse through everything. Simplifying this further may be where a formalized yaml spec comes in?

molpopgen commented 4 years ago

@grahamgower's gist is pretty slick, too. That, or something like it, would make a natural "intermediate representation" between the YAML and the "real" msprime objects. It could also be the base of a user-facing ModelBuilder in msprime. (And in fwdpy11, too, for that matter.)

grahamgower commented 4 years ago

Thanks @molpopgen, that's absolutely the direction I want this to go. One of my objectives here was to explore how the implicit semantics we've discussed could be imported into attrs data classes. It certainly wouldn't be much work to make a strictyaml schema for loading into this, and I'm hoping it will translate easily to a variety of other demography descriptions (including an msprime.Demography). It is, of course, still subject to bike-shedding, so please fire away if you think there are bits that need changing! It's also not at all clear whether this even belongs in msprime.

molpopgen commented 4 years ago

To minimize the bike-shedding, it seems like a finite vocabulary would help, to minimize the scope of the spec. Chatting w/@apragsdale offline, we also need epoch-variable things like selfing and cloning rates for forward sims, and a way to handle floating-point values.

But once a v1.0 of the vocab is in place, we could probably work out the details of a spec in a fresh repo somewhere to practice and break things.

molpopgen commented 4 years ago

@grahamgower -- this is a fun thing to add to attr classes: https://github.com/molpopgen/fwdpy11/blob/a8b439885644ffeb48ae947ccb081eea01bee58b/fwdpy11/class_decorators.py#L120-L138 I came up with this a few weeks ago so that examples in the manual print out nicely. It'll make big models more readable.

grahamgower commented 4 years ago

The black pretty-printing is very cool. But it looks like black just renamed s/FileMode/Mode/ since the latest release, so I get the sense the API is not stable.

jeromekelleher commented 4 years ago

It's also not at all clear whether this even belongs in msprime.

I think it's becoming clear that it doesn't belong here, and there's enough interest and progress now to start thinking about shipping it out to its own repo.

First question is, where should it live? I think it would be good if it lived in the tskit-dev GitHub Org, but we could certainly argue for somewhere else.

Second question is, what is it called? My initial thought is demography, but that clashes with @apragsdale's package, and it's a bit too broad anyway. Ideas? What we want is a core shared vocabulary for describing demographic models that can be used by multiple projects, so it should have minimal dependencies and be liberally licensed.

grahamgower commented 4 years ago

A tskit-dev project seems appropriate. How about demes, or demescription?

grahamgower commented 4 years ago

Actually, deme seems to be available on pypi. Then we could call our yaml (or whatever) a "deme script", and the in-memory version a deme scription.

molpopgen commented 4 years ago

The black pretty-printing is very cool. But it looks like black just renamed s/FileMode/Mode/ since the latest release, so I get the sense the API is not stable.

black claims to be in perpetual beta, so not surprising. That name change can be handled by a try/except. I'll add that soon, along with tests.

jeromekelleher commented 4 years ago

Actually, deme seems to be available on pypi. Then we could call our yaml (or whatever) a "deme script", and the in-memory version a deme scription.

It's snappy. The only reservation I have is "deme" doesn't convey a sense of historical populations or demographic structure. Nothing is perfect though. Googling for "python deme" gives no real results, which is good.

molpopgen commented 4 years ago

demeography would be slick and maximally confusing. 🤔

apragsdale commented 4 years ago

I didn't pick demography as a name for my package with any forethought about this.. it was just the name of my local directory when I turned it into a repo. So if that's a name we want for it, I can stop domain squatting that package name (it's not on pypi or anything like that, so I'm not sure how much it would clash). Otherwise something short would be nice, like demes or deme

Graham - that looks like a super nice bit of python code. I'm still wrapping my head around how attr works, but that is a great start

grahamgower commented 4 years ago

I like short names too. But yes, I agree that deme, singular, is not quite getting to the heart of what this is about. Also, deme is nice to keep available as a variable name. Maybe the plural demes is close enough though?

Other possibilities: demenet, demeanc, demented.

jeromekelleher commented 4 years ago

I like demes actually... Snappy and descriptive, once we're concrete about what a deme is.

apragsdale commented 4 years ago

Picking this back up.. @grahamgower I looked through your demegraph code a bit closer, and I think that's a really nice start to how we can nicely define and check the validity of the demographic structure. Perhaps we're at a point now to decide on some big picture things, like where it will live (I think in its own repo makes the most sense) and what features we'd like it to be able to support. Should we start mapping out the next steps? Happy to have a call about it, if that would be easier to coordinate.

cc: @molpopgen @jeromekelleher

molpopgen commented 4 years ago

A call of some sort would be useful.

jeromekelleher commented 4 years ago

Yes, let's have a call. I'll email the interested parties to get an initial discussion going.

jeromekelleher commented 4 years ago

We've decided to move this into its own repo: https://github.com/grahamgower/demes

For the record, we'd like msprime to be able to consume a demography described in a way like this, but we've decided that it's out of scope for msprime to define the language and semantics of the descriptions themselves.