Add a `graph.slice()` feature to API

popsim-consortium / demes-python

Tools for describing and manipulating demographic models.

https://popsim-consortium.github.io/demes-docs/

ISC License

18 stars 6 forks source link

Add a `graph.slice()` feature to API #332

Open apragsdale opened 3 years ago

apragsdale commented 3 years ago

In conversation with @molpopgen, it could be useful to support a method for slicing an existing graph at a given time, which will return a top half describing the demography up to that slice point, and a bottom half with the remainder that ignores the demographic events above the slice. This came up because, for example, we want to be able to simulate the top half with msprime and then add selected mutations to only the bottom half (e.g. drop a sweeping mutation at a time after the slice event using fwdpy11), in effect to "precapitate" using the faster neutral simulator.

In practice this could be a bit tricky. Below is a simple example of what we have in mind - slice the original model, returning two new graphs. The more ancient slice would just cut and reset times. The bottom slice would have demographic events and migrations from the slice to time zero, and above the slice time we would just add epochs to existing demes extending into the past.

Original graph:

description: pre-sliced demography
time_units: generations
demes:
- name: ancestral
  epochs:
  - {start_size: 100, end_time: 150}
  - {start_size: 200, end_time: 100}
- name: deme1
  ancestors: [ancestral]
  epochs:
  - {start_size: 50, end_size: 200, end_time: 0}
- name: deme2
  ancestors: [ancestral]
  epochs:
  - {start_size: 75, end_time: 20}
  - {start_size: 300, end_time: 0}
migrations:
- demes: [deme1, deme2]
  rate: 1e-2

Top slice:

description: after slicing original graph at time 50 (above slice)
time_units: generations
demes:
- name: ancestral
  epochs:
  - {start_size: 100, end_time: 100}
  - {start_size: 200, end_time: 50}
- name: deme1
  description: need to get size at slice time for epochs with changing size
  ancestors: [ancestral]
  epochs:
  - {start_size: 50, end_size: 100, end_time: 0}
- name: deme2
  ancestors: [ancestral]
  epochs:
  - {start_size: 75, end_time: 0}
migrations:
- demes: [deme1, deme2]
  rate: 1e-2

Bottom slice:

description: after slicing original graph at time 50 (below slice)
time_units: generations
demes:
- name: deme1
  epochs:
  - {start_size: 100, end_time: 50}
  - {start_size: 100, end_size: 200, end_time: 0}
- name: deme2
  epochs:
  - {start_size: 75, end_time: 50}
  - {start_size: 75, end_time: 20}
  - {start_size: 300, end_time: 0}
migrations:
- demes: [deme1, deme2]
  rate: 1e-2
  start_time: 50

Illustration using demesdraw (not deme widths are not to scale between panels): sliced

Some issues/questions:

What do people think? This would be useful for some simulation settings - is it worth it?
How would we handle slices that occur at the same exact time as existing discrete demographic events?
There would be corner cases where we need to ensure that epoch time spans do not equal zero.

grahamgower commented 3 years ago

That's an interesting idea. Can you elaborate on why (post) recapitation isn't your preferred approach? I think for recapitation you would then only need the bottom slice (for fwdpy11), and msprime could be given the full unsliced graph (it would just ignore events younger than the slice time).

In any event, the bottom half is likely the more awkward half to construct because we require that demes without ancestors have an infinite start time. Your suggestion to extend the oldest epochs in the botton slice towards inifinty might work, but whether this works will depend on how the simulator chooses the time for the start of the simulation (the time corresponding to the end of the burn in). For SLiM, I choose this based on the oldest non-infinite time listed in the simulation (which might be a pulse time, deme start_time, epoch end_time, or migration start_time or end_time). I guess you've done the same thing with fwdpy11? I feel we should try to match behavour here, if possible.

Assuming this is the one-true-way for forward simulators to choose the simulation start time, then just extending the bottom-slice demes to infinity leaves the simulation start time as ambiguous, and the simulation start time could end up much more recent than the slice time (but maybe that doesn't matter?). Maybe it's easier to generalise the forward simulation code to accept a time parameter for the simulation start time, and everything before this is just ignored (so slicing would happen during conversion to forwards-time generations)?

Some issues/questions:

What do people think? This would be useful for some simulation settings - is it worth it?

Definitely worth it for graphs with ancient events/structure, and non-neutral features in the recent past.

How would we handle slices that occur at the same exact time as existing discrete demographic events?

Following our existing conventions, the resulting top and bottom slices would span the open-closed time intervals (inf, slice_time] and (slice_time, 0]. So I think it's clear that the discrete events occur in the top half.

There would be corner cases where we need to ensure that epoch time spans do not equal zero.

Can you provide an example?

jeromekelleher commented 3 years ago

Sounds like an interest idea to me too. One question that occurs to me, though: why can't the simulators deal with this by having start and stop times? I think msprime is able to deal with this sort of thing?

molpopgen commented 3 years ago

One question that occurs to me, though: why can't the simulators deal with this by having start and stop times?

Because that would require an API modification just to accommodate demes, and documentation that a model built with demes need special treatment, which doesn't seem like what we want.

molpopgen commented 3 years ago

That's an interesting idea. Can you elaborate on why (post) recapitation isn't your preferred approach? I think for recapitation you would then only need the bottom slice (for fwdpy11), and msprime could be given the full unsliced graph (it would just ignore events younger than the slice time).

There are use cases were precapitation is preferable, such as simulating sweeps--dropping a mutation at a given frequency in a given deme requires pre-existing ancestry (in fwdpy11).

In any event, the bottom half is likely the more awkward half to construct because we require that demes without ancestors have an infinite start time. Your suggestion to extend the oldest epochs in the botton slice towards inifinty might work, but whether this works will depend on how the simulator chooses the time for the start of the simulation (the time corresponding to the end of the burn in). For SLiM, I choose this based on the oldest non-infinite time listed in the simulation (which might be a pulse time, deme start_time, epoch end_time, or migration start_time or end_time). I guess you've done the same thing with fwdpy11? I feel we should try to match behavour here, if possible.

The start time of the simulation (0) and the end of the burn-in are different things. When we read the model in, a burnin value is an option, and model times are adjusted by that value to be an offset from 0.

For the bottom time slice, one would convert it to the internal representation with burnin = 0, at least for the case @apragsdale described (start w/msprime, finish with a forward sim).

Assuming this is the one-true-way for forward simulators to choose the simulation start time, then just extending the bottom-slice demes to infinity leaves the simulation start time as ambiguous, and the simulation start time could end up much more recent than the slice time (but maybe that doesn't matter?). Maybe it's easier to generalise the forward simulation code to accept a time parameter for the simulation start time, and everything before this is just ignored (so slicing would happen during conversion to forwards-time generations)?

Well, this is tricky. IMO, one wouldn't want to modify the simulator (that's basically demes breaking things). One could modify the conversion code, though, which we haven't thought about.

molpopgen commented 3 years ago

One issue with the start/stop idea is that one would have to be calculating the deme sizes at the slice time. In theory, demes can now do that for you, but the function assumes continuous time, meaning discrete-time simulators will run into situations where the final sizes can be off by a bit.

For discrete time, one has to calculate the deme sizes a bit differently with exponential growth, e.g.:

double next_size = std::round(static_cast<double>(N0[deme])* std::pow(G[deme],static_cast<double>(t - onset[deme] + 1)));

More generally, since demes allows non-integer values for deme sizes, the size_at function may not be doing the right thing for discrete models, as it is ambiguous how to treat those Ne -- floor? round? ceil?

molpopgen commented 3 years ago

Because that would require an API modification just to accommodate demes, and documentation that a model built with demes need special treatment, which doesn't seem like what we want.

I'm referring here to API changes for the simulation function (analog of msprime.sim_ancestry). Changing the demes read-in/convert function is more reasonable.

grahamgower commented 3 years ago

There are use cases were precapitation is preferable, such as simulating sweeps--dropping a mutation at a given frequency in a given deme requires pre-existing ancestry (in fwdpy11).

Ok, nice!

In any event, the bottom half is likely the more awkward half to construct because we require that demes without ancestors have an infinite start time. Your suggestion to extend the oldest epochs in the botton slice towards inifinty might work, but whether this works will depend on how the simulator chooses the time for the start of the simulation (the time corresponding to the end of the burn in). For SLiM, I choose this based on the oldest non-infinite time listed in the simulation (which might be a pulse time, deme start_time, epoch end_time, or migration start_time or end_time). I guess you've done the same thing with fwdpy11? I feel we should try to match behavour here, if possible.

The start time of the simulation (0) and the end of the burn-in are different things. When we read the model in, a burnin value is an option, and model times are adjusted by that value to be an offset from 0.

For the bottom time slice, one would convert it to the internal representation with burnin = 0, at least for the case @apragsdale described (start w/msprime, finish with a forward sim).

Yes, by "simulation start time" I was meaning the time you choose to correspond with the end of the burn in. It's natural to choose the oldest non-infinite time in the model, and this may or may not be appropriate for the bottom half of your sliced model.

Well, this is tricky. IMO, one wouldn't want to modify the simulator (that's basically demes breaking things). One could modify the conversion code, though, which we haven't thought about.

I don't follow why this would be demes specific, or even required (except perhaps to permit this slice-based-precapitation). Surely being able to simulate a specific time interval of a demographic model is a feature that might be useful regardless of how the demographic model is given to the simulator?

molpopgen commented 3 years ago

Surely being able to simulate a specific time interval of a demographic model is a feature that might be useful regardless of how the demographic model is given to the simulator?

Perhaps, but it has never come up before, so it is not implemented. When building models using the low-level API, one would just build the model with the events at the "right times" or use Python operations to extract events from a larger model's dict representation. So it becomes a demes-specific need by default.

apragsdale commented 3 years ago

Your suggestion to extend the oldest epochs in the botton slice towards inifinty might work, but whether this works will depend on how the simulator chooses the time for the start of the simulation (the time corresponding to the end of the burn in). For SLiM, I choose this based on the oldest non-infinite time listed in the simulation (which might be a pulse time, deme start_time, epoch end_time, or migration start_time or end_time). I guess you've done the same thing with fwdpy11? I feel we should try to match behavour here, if possible.

Yes, this is the same approach I took in fwdpy11. In the example I originally posted, both demes in the bottom graph after slicing have their first epochs end at the time of the slice for exactly this reason, even though the size of deme2 doesn't change at that epoch break point:

- name: deme2
  epochs:
  - {start_size: 75, end_time: 50}
  - {start_size: 75, end_time: 20}
  - {start_size: 300, end_time: 0}

That way in fwdpy11 you would set the burn in time to 0 and the simulation would start exact at 50 generations ago, even if the sizes in the demes don't change at that time.

How would we handle slices that occur at the same exact time as existing discrete demographic events?

Following our existing conventions, the resulting top and bottom slices would span the open-closed time intervals (inf, slice_time] and (slice_time, 0]. So I think it's clear that the discrete events occur in the top half.

There would be corner cases where we need to ensure that epoch time spans do not equal zero.

Can you provide an example?

That was my thought at first, and we have a clear rule about the semi-open intervals that I think makes sense and we don't want to change. But imagine in the IM model above we slice at time = 100. In the top slice we would have the ancestral deme and then at the very end split it into deme1 and deme2, since the split should occur in the [100, inf) part of the graph. But the child demes would have start times of 0 and then existence times = 0. I think at some point it was decided that we don't want to allow that (and demes throws an error if start time is equal to zero).

We'd either need to allow demes to have zero length epochs (which I think actually makes sense in some other scenarios as well), or disallow slices that cause this effect, or ...?

Well, this is tricky. IMO, one wouldn't want to modify the simulator (that's basically demes breaking things). One could modify the conversion code, though, which we haven't thought about.

I hadn't thought of that, but the conversion code in fwdpy11 could additionally take an optional parameter that specifies the point in the graph to "start" the simulation, and then burn-in just runs with the state of the model at that point, if burn-in > 0. Though I could imagine running into some nasty corner cases there, and if a slice feature is useful in other settings as well, that would probably still be the cleanest approach.

apragsdale commented 3 years ago

We'd either need to allow demes to have zero length epochs (which I think actually makes sense in some other scenarios as well), or disallow slices that cause this effect, or ...?

What I have in mind here is that you might want to simulate one or more populations and then at time zero they are divided into multiple sub-demes. I could see this scenario coming up in some EE study or maybe doing some simulation based power analysis testing the difference between subdivision with evolution vs sub-demes coming immediately from the same panmictic population. I guess.. I can imagine setting up that scenario with actual organisms, so we might want to also allow simulating that scenario in demes.

grahamgower commented 3 years ago

Yes, this is the same approach I took in fwdpy11. In the example I originally posted, both demes in the bottom graph after slicing have their first epochs end at the time of the slice for exactly this reason, even though the size of deme2 doesn't change at that epoch break point:
- name: deme2
  epochs:
  - {start_size: 75, end_time: 50}
  - {start_size: 75, end_time: 20}
  - {start_size: 300, end_time: 0}
That way in fwdpy11 you would set the burn in time to 0 and the simulation would start exact at 50 generations ago, even if the sizes in the demes don't change at that time.

Oooh, that's very clever! Sorry, I missed that in your original post.

That was my thought at first, and we have a clear rule about the semi-open intervals that I think makes sense and we don't want to change. But imagine in the IM model above we slice at time = 100. In the top slice we would have the ancestral deme and then at the very end split it into deme1 and deme2, since the split should occur in the [100, inf) part of the graph. But the child demes would have start times of 0 and then existence times = 0. I think at some point it was decided that we don't want to allow that (and demes throws an error if start time is equal to zero).

We'd either need to allow demes to have zero length epochs (which I think actually makes sense in some other scenarios as well), or disallow slices that cause this effect, or ...?

I don't think this is a problem for your specific example of wanting to precapitate, because you can pass the unsliced graph to msprime and ask it to start at the slice time. Events at that time should be handled immediately (but I didn't check/test).

import demes
import msprime

def demes_alive_at(graph, time):
    return [deme for deme in graph.demes if deme.start_time > time >= deme.end_time]

T_slice = 100
graph = demes.load("preslice.yaml")
demog = msprime.Demography.from_demes(graph)
samples = {
    deme.name: round(deme.size_at(T_slice)) for deme in demes_alive_at(graph, T_slice)
}
ts = msprime.sim_ancestry(demography=demog, samples=samples, start_time=T_slice)

One issue with the start/stop idea is that one would have to be calculating the deme sizes at the slice time. In theory, demes can now do that for you, but the function assumes continuous time, meaning discrete-time simulators will run into situations where the final sizes can be off by a bit.

For discrete time, one has to calculate the deme sizes a bit differently with exponential growth, e.g.:
double next_size = std::round(static_cast<double>(N0[deme])* std::pow(G[deme],static_cast<double>(t - onset[deme] + 1)));
More generally, since demes allows non-integer values for deme sizes, the size_at function may not be doing the right thing for discrete models, as it is ambiguous how to treat those Ne -- floor? round? ceil?

I don't have any suggestions here, other than to round to an integer however you see fit. Is the maximum error here greater than 1 individual?

molpopgen commented 3 years ago

I don't have any suggestions here, other than to round to an integer however you see fit. Is the maximum error here greater than 1 individual?

~~yeah, it can be. The compounded interest makes a difference.~~

It shouldn't be, if you treat both ends consistently. The problem is that someone may use math.round, which (amazingly) rounds incorrectly (which is documented).

grahamgower commented 3 years ago

Haha, yeah math.round is curious in that it rounds halves towards even numbers. But its not alone in that behaviour, as many languages have chosen to do the same thing (e.g. R). It's just annoying that there's no round-halves-up function in the standard lib.

molpopgen commented 3 years ago

Haha, yeah math.round is curious in that it rounds halves towards even numbers.

But not always, I think. I happened upon this lovely situation after writing loads of quiz questions for my class in Python and I could not predict the answer.

It's just annoying that there's no round-halves-up function in the standard lib.

One usually wants round-to-even here, which np.round does (in addition to IEEEfp concerns). Round-halves-up leads to biases for the case of summing over rounded values. The biases can be big. This caused another problem in our course. People with lots of physical sciences background found round-to-even natural, but that's a minority of bio majors. The majority were unhappy to have to unlearn the methods they absorbed in grade school.

molpopgen commented 3 years ago

@grahamgower -- this is the behavior that is irksome. (Had to look at my issues tickets for my course materials).

In [1]: import numpy as np

In [2]: x = 2.675

In [3]: round(x, 2)
Out[3]: 2.67

In [4]: np.round(x, 2)
Out[4]: 2.68

In [5]:

From the docs:

Note

The behavior of round() for floats can be surprising: for example, round(2.675, 2) gives 2.67 instead of the expected 2.68. This is not a bug: it’s a result of the fact that most decimal fractions can’t be represented exactly as a float. See Floating Point Arithmetic: Issues and Limitations for more information.

grahamgower commented 3 years ago

That is surprising, given that 2.675 seems to be represented exactly as a float.

molpopgen commented 3 years ago

That is surprising, given that 2.675 seems to be represented exactly as a float

I think Python may not be using c99? The correct round to even is trivial to implement using nearestint instead of round. R has the same error with this exact operation, so I'm guessing it is a common back end issue.

grahamgower commented 3 years ago

We'd either need to allow demes to have zero length epochs (which I think actually makes sense in some other scenarios as well), or disallow slices that cause this effect, or ...?

What I have in mind here is that you might want to simulate one or more populations and then at time zero they are divided into multiple sub-demes. I could see this scenario coming up in some EE study or maybe doing some simulation based power analysis testing the difference between subdivision with evolution vs sub-demes coming immediately from the same panmictic population. I guess.. I can imagine setting up that scenario with actual organisms, so we might want to also allow simulating that scenario in demes.

Did you have any further thoughts here @apragsdale?

I think we really do want to keep the current restriction that epoch lengths must be strictly greater than 0. It's fairly common to want to divide some number by an epoch's or deme's time span, so that code would need special cases to deal with zero-length epochs (code in the wild would inevitably be buggy, because such zero-length epochs wouldn't be tested). The example you provide here could be dealt with by just sampling individuals at time zero after the simulation has completed. Or am I missing something?

A lower_slice() function could return just the bottom half of the graph, which should always be valid if the original graph is valid. The upper slice would be problematic, as you rightly point out. But I'm not convinced the top slice is needed for p/recapitation, at least with msprime. In any event, maybe there's some commonality here with the prune() operation in #167?