think about parameterised models

popsim-consortium / demes-python

Tools for describing and manipulating demographic models.

https://popsim-consortium.github.io/demes-docs/

ISC License

18 stars 6 forks source link

think about parameterised models #63

Closed grahamgower closed 3 years ago

grahamgower commented 3 years ago

Once the dust settles, we'll want to add more complexity. :P

Currently, demes deals with concrete demographic models, ie. those with fixed parameter values. It would be really neat if we could also support parameterised demographic models, where parameters have a distribution and optionally depend on other parameters (eg. for an upper or lower bound). The motivation here, is to draw values from the distribution for simulation, do parameter inference using the parameterised model as a prior, or even output an inferred posterior parameterised demographic model.

This would need at least: (1) additional syntax in the yaml file, (2) a dependency graph for the parameters, and (3) an internal representation of parameterised models, from which it is cheap to construct many concrete models. We should consider carefully the design of the current version of demes, to avoid unnecessarily making this harder in the future.

jeromekelleher commented 3 years ago

Probably a good idea to sketch out what this might look like at the yaml level at least, as that'll be the hardest thing to change later.

apragsdale commented 3 years ago

Yes, I'm very excited to work toward that feature! Agree that we'll need to think a bit about how it looks at the YAML level and how to encode dependencies (as an auxiliary file or within the YAML, e.g.).

In the meantime though, maybe we should put together a milestone/checklist for demes version 1.0, to have a target for a note/paper and present to stdpopsim to get folks on board. I really think it's quite close.

grahamgower commented 3 years ago

How about having an additional variables stanza in the yaml, and permitting a variable name anywhere a numerical value is currently accepted? I imagine that a variable needs to have various properties such as : description, distribution type, distribution parameters, upper bound, lower bound. And we'll likely need to accept expressions, or the variables become overly restrictive.

description: |
  Ancestral deme A suffers a bottleneck at time T_bottleneck, recovering 100 generations later.
  After the bottleneck, demes B and C split from A at time T_split.
  After the split, there is an admixture pulse from B into C at time T_pulse.
time_units: generations

variables:
  T_bottleneck:
    # Might want some shorthand for fixed variables (e.g. "T_bottleneck: 5000").
    description: start time for the bottleneck in deme A
    distribution:
      fixed: 5000
  T_split:
    description: time at which demes B and C split off from deme A
    distribution:
      # Distributions and parameters could be structured as nested fields.
      normal:
        mean: 2000
        stddev: 100
    # Specify upper and lower bounds, producing a truncated distribution.
    upper: T_bottleneck
    lower: 1000
  T_pulse:
    description: time of admixture pulse from deme B into deme C.
    distribution:
      # Distribution parameters could instead be a list (or we could permit both).
      uniform: [1, T_split]
    # T_pulse bounds are implied by uniform params; no need to set explicitly.

demes:
  A:
    initial_Size: 1000
    epochs:
      - start_time: T_bottleneck
        initial_size: 100
      # It seems likely we'll want to accept expressions wherever numerical
      # values are accepted, where expressions may also contain variables.
      - start_time: T_bottleneck + 100
        initial_size: 1000
  B:
    ancestors: A
    start_time: T_split
    initial_Size: 1000
  C:
    ancestors: A
    start_time: T_split
    initial_Size: 1000

pulses:
- source: B
  dest: C
  time: T_pulse
  proportion: 0.01

jeromekelleher commented 3 years ago

Feels like it'll be hard to get that to stick with yaml, and we'd end up embedding some sort of turing complete language into the description. Doing this in a semantically sensible way across languages would be really hard.

What if we considered each of the parameters to be an estimate, which by default are single point estimates. The model then is that the DemeGraph describes each parameter and its distribution. So, basically what you suggest except we embed the information about the distributions directly in line. I agree there's problems on how you deal with things like time periods that depend on the value of earlier times.

However, unless we go the whole hog and describe the model in something like Lua we can't have arithmetic expressions. There's just no way we can imagine being able to support that in a C implementation of the demes parser.

grahamgower commented 3 years ago

Feels like it'll be hard to get that to stick with yaml, and we'd end up embedding some sort of turing complete language into the description. Doing this in a semantically sensible way across languages would be really hard.

Yeah, supporting expressions does have problems, which is part of the reason I wanted to start thinking about it sooner rather than later (when it's much harder to change things). The expressions here will just become strings though, so this should be embeddable in yaml. I think.

What if we considered each of the parameters to be an estimate, which by default are single point estimates. The model then is that the DemeGraph describes each parameter and its distribution. So, basically what you suggest except we embed the information about the distributions directly in line. I agree there's problems on how you deal with things like time periods that depend on the value of earlier times.

I'm a little lost here. Any chance you can provide a small example? I suspect we really do need an approach that permits variables depending on other variables.

However, unless we go the whole hog and describe the model in something like Lua we can't have arithmetic expressions. There's just no way we can imagine being able to support that in a C implementation of the demes parser.

No, you're right. This definitely can't work in the yaml parser itself. But I really do think we want to have expressions somehow. Without this, the parameterised models are not worth having at all, because you can't even relate one variable to another by a constant. And once we permit simple expressions, then we almost certainly also want to permit a restricted set of functions like log and exp.

jeromekelleher commented 3 years ago

I'm a little lost here. Any chance you can provide a small example?

demes:
  A:
    initial_size: 1000 # single point estimate
  B:
    initial_size: # Or a distribution
        distribution:
            normal:
               mean: 2000
               stddev: 100

So, it's just what you said, except we embed the description of the distributions directly.

I'm very nervous about embedding any type of arithmetic or logic in here - all sorts of problems start popping up, which would take away the point of using yaml at all I think. This must be declarative and static, with simple and unambiguous semantics.

It seems to me that most of the need for expressions comes from time values - that time values later in the model are dependent on those chosen earlier. Would this be handled by allowing some way of stating time delta values rather than absolute times? Say, if a start_time value starts with "+" we consider it as an increment to the previous epoch's time? Like

demes:
  A:
    epochs:
      - start_time: Normal(2000, 100)   # Shorthand for value distribution above
        initial_size: 10
      - start_time: +100 # 100 units *after* the last epoch
        initial_size: 100

This level of complexity feels manageable to me. Although, I'd imagine the "+" syntax would make it easy to introduce bugs where someone means to write a relative time but they actually write an absolute time. Still, would allowing for relative time increments in some form resolve the need for global state and arithmetic operations?

grahamgower commented 3 years ago

I do empathise with not wanting expressions in the yaml. But I don't think the time delta facility is going to cut it. In any event, this is a very informative discussion, as it strongly suggests that we shouldn't do parameterised models in the yaml format at all.

Parameterised models could be something we facilitate within the python library though. What I mean here, is that we provide some consistent API where a concrete model can be easily drawn from a parameterised model: this would work fine for simulation with msprime or fwdpy11 (we just translate concrete models). But I don't think this will work for moments and dadi though, as they have their own way of doing optimisation with their own parameter update steps. So here, we might think up some way to translate to their notion of a parameterised model. I don't know how realistic this latter possibility is.

jeromekelleher commented 3 years ago

An important question to consider here is, what could we do with such parameterised models, if we had them? I can imagine the demes Python package allowing us to sample concrete instantiations of these parameterised models which could be passed to simulators (and presumably things like moments and dadi, too?).

It seems harder to imagine what individual simulators (i.e., something not using the demes Python package) could do with such model descriptions. Suppose that SLiM had a demes parser and it read in a model description that included parameters using distributions. What would it do then? Run lots of simulations? How would you deal with the output of these simulations?

So yes, I've come around to the same point of view as you @grahamgower. The yaml description is to allow interchange of models with fixed parameters. We might provide some facilities to describe meta-models that have parameters drawn from distributions in the demes Python package, but the key output would be sampled concrete models that could be fed to simulators, via the yaml description (if the simulators aren't written in Python). There's nothing to be gained from simulators needing to understand distributions of parameters, because fundamentally, simulators work on concrete parameter values.

The other side of the coin then, though, is that we want inference methods to output demes yaml descriptions (interchange). They will want some way of specifying the precision of their parameter estimates. Can we facilitate this, or should we just ignore it?

grahamgower commented 3 years ago

An important question to consider here is, what could we do with such parameterised models, if we had them?

Use them as input for parameter inference methods. And, if the method produces posterior distributions rather than point estimates, use them for the output as well. For this to be truly useful, it must be possible to do more than just specify the precision of a marginal parameter estimate---it must be possible to describe high-dimensional joint distributions (with correlations between variables). Actually, the more I think about this, the more it seems like a pipe dream. :(

It seems harder to imagine what individual simulators (i.e., something not using the demes Python package) could do with such model descriptions. Suppose that SLiM had a demes parser and it read in a model description that included parameters using distributions. What would it do then? Run lots of simulations? How would you deal with the output of these simulations?

That's a fair question. A simulator could sample a concrete model from the parameters. Or use expected values to instantiate a model of central tendency? It would probably be fair for a simulator to reject such a yaml file though. As you say, the demes Python package could instead be used do such sampling of concrete models to be passed to the simulator.

apragsdale commented 3 years ago

I was fairly hopeful at first that we'd be able to fully describe parameterized models in the YAML description, where times and attributes could be variable and we only need to specify bounds and relative dependencies (e.g. T_A < T_B), and then could run inference as simply as moments.Inference.from_demes("model.yaml"). But I think this discussion is pointing out just how difficult and messy that would be. I guess now I'm leaning toward keeping the python-demes API clean and more or less as-is, and the YAML description remains just for static demographic models.

Instead, I think the API is flexible enough so that someone like me who wants to use demes in an inference setting, such as moments, could easily write a wrapper that takes as parameters (demo_params, bounds, a list of rules such as relative ordering) and that function creates and returns a demes object to pass to the simulator. I think that a lot of the details of how you write parameterized models will be simulator-specific, and it's going to be difficult to write something that works with each moments- or dadi-based inference, drawing parameters for msprime or fwdpy11, and also plays nicely with slim or whatever other software gets on board.

There's still the question of how do you report a model that is inferred in an ABC framework, where you've inferred distributions for model parameters? Do we just have to say that it's not really possible, and we only support writing out a YAML with, say, the maximum likelihood parameters?

grahamgower commented 3 years ago

There's still the question of how do you report a model that is inferred in an ABC framework, where you've inferred distributions for model parameters? Do we just have to say that it's not really possible, and we only support writing out a YAML with, say, the maximum likelihood parameters?

What if an inference method (ABC, or MCMC) just spits out yaml files drawn from the posterior distribution? Most yaml files are going to be < 5kb (the Jacobs et al. Papaun model is 2.3kb). It bet it would take only a few Mb to store 100,000 of them compressed.

jeromekelleher commented 3 years ago

There's an elegant simplicity to this, I like it @grahamgower ! In a way, this is the only way to fully describe these complex interconnected distributions. If they could be simply stated, then we probably wouldn't need to do MCMC or whatever in the first place.