kingaa / pomp

R package for statistical inference using partially observed Markov processes
https://kingaa.github.io/pomp
GNU General Public License v3.0
114 stars 27 forks source link

Big discrepancy between euler and gillespie/ode sample paths #128

Closed fintzij closed 4 years ago

fintzij commented 4 years ago

I am trying to code up an SEIR model in pomp and am seeing a rather large discrepancy in the distribution of sample paths that I generate when simulating with the Euler-multinomial stepper compared with both gillespie simulation and the deterministic ODE skeleton. I've attached some code below, which is largely based on examples in the pomp documentation. I'm fairly certain that the gillespie/ode algorithms are producing the correct paths as I have also benchmarked the results against another package, but I can't for the life of me figure out where I'm going wrong in the specification of the Euler-multinomial stepper. I'd really appreciate any advice you could offer as I've run out of rocks to look under. It's really a simple model so if it's an error on my end I'm sure it's a rather silly one, but I just can't seem to find it.

Thank you!!!

library(pomp)
library(ggplot2)

# pomp object ---------------------------------------------------------------------------------

# Measurement process objects
rmeas <- "
  cases=rnbinom_mu(phi, E2I * rho);    // simulates the data
"

dmeas <- "
  lik=dnbinom_mu(cases, phi, E2I * rho, give_log); // emission density
"

# define the steppers
seir_step<-
      euler(step.fun = 
      Csnippet("double rate[3];
                double dN[3];
                rate[0]=theta * I;                           // Infection rate
                rate[1]=omega;                               // rate of E to I transitions
                rate[2]=mu;                                  // recovery rate
                reulermultinom(1,S,&rate[0],dt,&dN[0]);      // new infections
                reulermultinom(1,E,&rate[1],dt,&dN[1]);      // new progressions
                reulermultinom(1,I,&rate[2],dt,&dN[2]);      // new recoveries
                S+=-dN[0];                                   // update the number of Susceptibles
                E+=dN[0]-dN[1];                              // update the number of Infections
                I+=dN[1]-dN[2];                              // update the number of Infections
                R+=dN[2];                                    // update the number of Recoveries
                E2I+=dN[1];
                "), delta.t = 1/7)

seir_gillespie = 
      gillespie_hl(
            exposure = list("rate = theta * I * S;", 
                            c(S=-1,E=1,I=0,R=0,E2I=0)),
            infection = list("rate = omega * E;", 
                             c(S=0,E=-1,I=1 ,R=0,E2I=1)),
            recovery = list("rate = mu * I;", 
                            c(S=0,E=0,I=-1,R=1,E2I=0))
      )

seir_odes <- Csnippet("
  DS = -theta * S * I;
  DE = theta * S * I - omega * E;
  DI = omega * E - mu * I;
  DR = mu * I;
  DE2I= omega * E;
")

# parameters and population size
popsize = 5e4
pars = c(theta      = 1.3 / popsize / (2.5/7), # basic reproduction number
         omega      = 1 / (2/7), # latent period duration = 2 days
         mu         = 1 / (2.5/7), # infectious period duration = 2.5 days
         rho        = 0.02, # case detection rate
         phi        = 36)  # negative binomial overdispersion
parnames = names(pars)

# Create pomp objects ---------------------------------------------------------------

seir_euler <- pomp(
      data=data.frame(time=1:30,cases=NA),
      times = "time",
      t0 = 0,                      # initial time point
      dmeasure = Csnippet(dmeas),  # evaluates the density of the measurement process
      rmeasure = Csnippet(rmeas),  # simulates from the measurement process
      rprocess = seir_step,          # simulates from the latent process
      statenames = c("S", "E", "I", "R", "E2I"),  #state space variable names
      params = pars,
      paramnames = parnames, #parameters name
      accumvars = c("E2I"),
      rinit = function(params, t0, ...) {
            return(c(S = popsize-50, E = 25, I = 25, R = 0, E2I = 0))
      }
)

seir_gillespie <- pomp(
      data=data.frame(time=1:30,cases=NA),
      times = "time",
      t0 = 0,                      # initial time point
      dmeasure = Csnippet(dmeas),  # evaluates the density of the measurement process
      rmeasure = Csnippet(rmeas),  # simulates from the measurement process
      rprocess = seir_gillespie,          # simulates from the latent process
      statenames = c("S", "E", "I", "R", "E2I"),  #state space variable name
      params = pars,
      paramnames = parnames, #parameters name
      accumvars = c("E2I"),
      rinit = function(params, t0, ...) {
            return(c(S = popsize-50, E = 25, I = 25, R = 0, E2I = 0))
      }
)

seir_ode <- pomp(
      data=data.frame(time=1:30,cases=NA),
      times = "time",
      t0 = 0,                      # initial time point
      dmeasure = Csnippet(dmeas),  # evaluates the density of the measurement process
      rmeasure = Csnippet(rmeas),  # simulates from the measurement process
      skeleton = vectorfield(seir_odes), # simulates from the latent process
      statenames = c("S", "E", "I", "R", "E2I"),  #state space variable name
      params = pars,
      paramnames = parnames, #parameters name
      accumvars = c("E2I"),
      rinit = function(params, t0, ...) {
            return(c(S = popsize-50, E = 25, I = 25, R = 0, E2I = 0))
      }
)

# Generate simulations --------------------------------------------------------------

euler_sims = simulate(seir_euler, nsim = 1e3)
gillespie_sims = simulate(seir_gillespie, nsim = 1e3)
ode_path = trajectory(seir_ode, format = "data.frame")

# summarize and plot ----------------------------------------------------------------

euler_quants = 
      data.frame(time = 1:30,
                 t(apply(do.call(cbind, lapply(euler_sims, function(x) x@states["E2I",])),
                         1, quantile, c(0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95))))

gillespie_quants = 
      data.frame(time = 1:30,
                 t(apply(do.call(cbind, lapply(gillespie_sims, function(x) x@states["E2I",])),
                         1, quantile, c(0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95))))

quants = data.frame(method = rep(c("Euler", "Gillespie"), each = 30),
                    rbind(euler_quants, gillespie_quants))

# plot
ggplot(quants, aes(x = time, y = X50., colour = method)) + 
      geom_line() + 
      geom_ribbon(data = quants,
                  aes(ymin = X5., ymax = X95., fill = method, colour = method),
                  alpha = 0.3) + 
      geom_ribbon(data = quants,
                  aes(ymin = X10., ymax = X90., fill = method, colour = method),
                  alpha = 0.3) + 
      geom_ribbon(data = quants,
                  aes(ymin = X25., ymax = X75., fill = method, colour = method),
                  alpha = 0.3) +
      geom_line(data = ode_path, aes(x= time, y = E2I), colour = "black") +
      scale_fill_brewer(type = "qual", palette = 1) + 
      scale_color_brewer(type = "qual", palette = 1) + 
      theme_minimal() 

issue128-1

kingaa commented 4 years ago

You raise several interesting issues. Among them are:

  1. What is the relationship between the Euler and the Gillespie simulators?
  2. What is the relationship between stochastic simulations and a deterministic skeleton?
  3. What is the role of exact simulations in data analysis?

Perhaps there are others that you would like to discuss as well.

On the first issue: the Gillespie algorithm is an exact simulator for a continuous-time, homogeneous generalized birth-death process. In such a system, stochasticity is purely demographic and per-capita event rates are constant. The Euler method, by contrast, is an approximation scheme, and a first-order one at that. In particular, the Euler process converges to the Gillespie one as the Euler time-step tends to zero. Below, I have modified your code to use a much smaller timestep. You see that the distributions agree to a much higher degree of precision.

On the second issue, there are several reasons to expect that there may be large differences between sample paths of a stochastic process and trajectories of its deterministic skeleton. For one, the deterministic skeleton is not uniquely defined. Second, if the noise is not small, stochastic trajectories may exhibit large deviations from the deterministic trajectories. Third, even if the noise is small, dynamics that are purely transient in the deterministic system will be manifest in the stochastic system due to re-excitation of transients. Given all this, it is reasonable to ask why anyone takes interest in the skeleton at all, and indeed some of the interest is purely historical. However, it is sometimes the case that a deterministic analysis can shed light on the behavior of a stochastic system.

On the third issue, some philosophical reflections are in order. One might expect that it is very important that one have a high-fidelity stochastic simulator—an exact one if possible—in any data analysis. Certainly (it would seem) development and deployment of exact simulation methods would be a useful thing for a mathematically-inclined person to devote time to. But consider the following: How much effort is it worth expending to obtain exact solutions for a model that is itself a poor approximation to the reality one is attempting to describe? How much effort should go into improving the fidelity of the simulations to the model, and how much into making the model/reality connection more faithful?

For example, inasmuch as the SIR model we have been discussing is meant to explain a series of observations from a real system, it is clear that many, potentially important, effects are missing: Are contact rates really constant, or do they perhaps vary over the course of a day? If so, wouldn't it be better to include diurnal variation in the contact rates? Doing so would take Gillespie off the table, unfortunately, although one could develop a custom exact stochastic simulation algorithm that would only necessitate integrating various hazards over short time intervals. One could go on. The point is not that exact simulation methods are without value in a data analysis, but that they are costly, and that one must consider what value one obtains for the price. There can be no single, general answer to the questions I posed in the preceding paragraph, but the issue is present in every data analysis.

To put it another way, in simulation-based analysis, all commerce between model and data passes through the simulator. Thus, in a perfectly exact sense, the simulator is the model that attempts to explain the data. From this point of view, the model equations are merely approximations of the simulator, more or less worthy of study according to how well they represent the simulations.

library(pomp)
library(tidyverse)

## define the steppers
seir_step <-Csnippet("
  double rate[3];
  double dN[3];
  rate[0]=theta * I;                           // Infection rate
  rate[1]=omega;                               // rate of E to I transitions
  rate[2]=mu;                                  // recovery rate
  reulermultinom(1,S,&rate[0],dt,&dN[0]);      // new infections
  reulermultinom(1,E,&rate[1],dt,&dN[1]);      // new progressions
  reulermultinom(1,I,&rate[2],dt,&dN[2]);      // new recoveries
  S+=-dN[0];                                   // update the number of Susceptibles
  E+=dN[0]-dN[1];                              // update the number of Infections
  I+=dN[1]-dN[2];                              // update the number of Infections
  R+=dN[2];                                    // update the number of Recoveries
  E2I+=dN[1];"
)

seir_gillespie <-
  gillespie_hl(
    exposure = list("rate = theta * I * S;",
                    c(S=-1,E=1,I=0,R=0,E2I=0)),
    infection = list("rate = omega * E;",
                     c(S=0,E=-1,I=1 ,R=0,E2I=1)),
    recovery = list("rate = mu * I;",
                    c(S=0,E=0,I=-1,R=1,E2I=0))
  )

seir_odes <- Csnippet("
  DS = -theta * S * I;
  DE = theta * S * I - omega * E;
  DI = omega * E - mu * I;
  DR = mu * I;
  DE2I= omega * E;
")

## parameters
popsize <- 5e4
pars <- c(
  theta      = 1.3 / popsize / (2.5/7), # basic reproduction number
  omega      = 1 / (2/7), # latent period duration = 2 days
  mu         = 1 / (2.5/7), # infectious period duration = 2.5 days
  phi        = 36, # negative binomial overdispersion
  popsize    = popsize
)
parnames <- names(pars)
statenames <- c("S", "E", "I", "R", "E2I")  #state space variable name

seir <- pomp(
  data=data.frame(time=1:30,cases=NA),
  times = "time",
  t0 = 0,                            # initial time point
  skeleton = vectorfield(seir_odes), # simulates from the latent process
  params = pars,
  paramnames = parnames,
  statenames = statenames,
  accumvars = c("E2I"),
  rinit = function(popsize, t0, ...) {
    return(c(S = popsize-50, E = 25, I = 25, R = 0, E2I = 0))
  }
)

## Generate simulations: NOTE WE HAVE TWO SETS OF EULER SIMULATIONS

seir %>%
  simulate(
    rprocess=euler(seir_step,delta.t=1/7),
    statenames=statenames,
    paramnames=parnames,
    nsim = 1e3,
    format="d"
  ) -> euler_sims1

seir %>%
  simulate(
    rprocess=euler(seir_step,delta.t=1/700),
    statenames=statenames,
    paramnames=parnames,
    nsim=1e3,
    format="d"
  ) -> euler_sims2

seir %>%
  simulate(
    rprocess=seir_gillespie,
    statenames=statenames,
    paramnames=parnames,
    nsim=1e3,
    format="d"
  ) -> gillespie_sims

seir %>%
  trajectory(format="d") -> ode_path

## summarize and plot ----------------------------------------------------------------

bind_rows(
  Euler1 = euler_sims1 %>% select(time,rep=.id,E2I),
  Euler2 = euler_sims2 %>% select(time,rep=.id,E2I),
  Gillespie = gillespie_sims %>% select(time,rep=.id,E2I),
  .id = "simulator"
) %>%
  group_by(simulator,time) %>%
  summarize(
    p=c(0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95),
    q=quantile(E2I,prob=p),
    label=sprintf("Q%d%%",100*p)
  ) %>%
  select(-p) %>%
  spread(label,q) %>%
  ungroup() -> quants

## plot

quants %>%
  ggplot(aes(x = time, y = `Q50%`)) +
  geom_ribbon(aes(ymin = `Q5%`, ymax = `Q95%`, fill=simulator), color=NA, alpha = 0.3) +
  geom_ribbon(aes(ymin = `Q10%`, ymax = `Q90%`, fill=simulator), color=NA, alpha = 0.3) +
  geom_ribbon(aes(ymin = `Q25%`, ymax = `Q75%`, fill=simulator), color=NA, alpha = 0.3) +
  geom_line(aes(color=simulator)) +
  geom_line(data = ode_path, aes(x= time, y = E2I), colour = "black") +
  guides(color=FALSE)+
  labs(y="incidence")+
  scale_fill_brewer(
    type = "qual", palette = 1,
    labels=c(Euler1="Euler, dt=1/7",Euler2="Euler, dt=1/700",Gillespie="Gillespie")) +
  scale_color_brewer(type = "qual", palette = 1) +
  theme_minimal()

issue128-2

fintzij commented 4 years ago

Hi Aaron,

Thanks so much for the quick and thorough explanation!!!! Very much appreciated, and especially on a weekend :-D

I'm surprised the Euler time step made such a huge difference! I had initially coded up this model for use in inference, and was getting very poor coverage rates in simulations for key parameters like R0. Hence, the original motivation for comparing the Euler and Gillespie simulations was actually to debug the model. It's very interesting that the Euler time step has such an outsize influence on the average trajectory of the outbreak. I would have guessed that a coarse time-step would have affected the variance around the mean, but it makes sense since, as you point out, the Euler stepper is a first order approximation.

Totally agree with you on the question of whether the juice is worth the squeeze on exact methods. Maybe it would be a different story if we had subject-level data, but there are much more important aspects of the model that we should worry about when developing population-level models to fit to aggregate count data.

Hope you're doing well, thanks again for the help!!

Cheers, Jon

kingaa commented 4 years ago

Musings:

  1. "Outsize", relative to what?
  2. "Variance around the mean", on what scale? Cf. Jensen's inequality.
fintzij commented 4 years ago

Ah fair.

Re 1: I was surprised that too coarse a time step has an effect that essentially results in trajectories that we would expect under a much higher basic reproduction number (comparing the green and orange/purple bands in the plot you posted). This doesn't just appear in forward simulation. In coverage simulations, where exact outbreak paths are simulated from a Markov jump process using the Gillespie algorithm, I was getting 80% CIs that contained the true R0 value about 35% of the time. But as you pointed out, it's important to think about the relationship between the Euler and Gillespie simulators and understand that the choice of time step is part of the approximation to the exact process. So, I mean "outsize" not in the relative sense but in the absolute sense since the time step can have an exceptionally large effect.

Re 2: Oops, I didn't mean the mean but the infinite population ODE limit. We're working on a manuscript in which we use the linear noise approximation to fit stochastic epidemic models to partially observed incidence data (preprint here), so that's where my head is at. In the non-restarting formulation of the LNA (and also in the restarting formulation, but only over a single time interval), the deterministic ODE path is the mean of the distribution of latent epidemic paths. Excellent point about scales though. We though about this and prefer to work on the log scale. Back to the Euler approximation, my intuition was totally wrong here, partly because I was in LNA land, but mostly because I hadn't really thought about the consequences of using a coarse time interval. I had assumed that, in large populations, the Euler-multinomial trajectories would generally track the infinite population ODE limit and generally resemble the distribution of paths obtained via exact simulation, but possibly with more or less variance around the limiting ODE path on some scale (say the log scale). But, as you pointed out, the Euler method first order approximation, so it's only a good approximation locally, which obligates us to think harder than I did about what locally means. This is especially true if we're approximating a Markov jump process over intervals where there's a fair amount of non-linearity in the dynamics!

kingaa commented 4 years ago

Fair enough. If you take the continuous-time Markov process to be the truth, then the discretization introduces error, sure enough. I just think it's worth questioning what seems to be the widely held assumption that the continuous-time process, or the ODE limit, is a good choice of foundation on which to build expectations about real epidemics. In particular, it seems to be fairly generally the case that, as the population size under consideration gets larger, heterogeneities in the population become more noticeable, which frustrates the approach to the large-population deterministic limit. For example, it's not clear that extrapolation of R0 from the deterministic (or nearly deterministic) model is necessarily more meaningful than it is from another, less canonical model.

Regarding the relationship between discretization interval and discrepancies between deterministic and stochastic paths: isn't the issue at bottom whether there are timescales in the dynamics that are fast relative to the discretization interval? In the SIR, the epidemic process is fastest at the outset, and the discretization errors lead, through Jensen's inequality, to bias in the growth rate, hence the result that the epidemic grows faster than it "should". If this is right, then the discretization errors should be much smaller once the epidemic has taken hold. I wonder if the bias would be as evident if the epidemic were of the seasonally recurring sort instead of the virgin case...?

Thanks for calling my attention to the preprint. I've been following the work, and I like it a lot.

fintzij commented 4 years ago

Oh, yeah agreed. A time-homogeneous process is unambiguously a terrible generative model for real world applications. I'd add that the homogeneity isn't just a problem because the unobserved population transmission process becomes more heterogeneous as the population size increases. From a statistical perspective, I think we should be really worried that heterogeneities in our our surveillance processes are being swept under the rug when we work with aggregated data. For example, in the COVID-19 setting (unrelated to this project), we really struggled with spatial and temporal variations in reporting standards, preferential testing, case definitions, etc. This was the case even within counties and especially early on. We're not really addressing heterogeneity in dynamics and surveillance in the preprint or its revision, which is more focused on computation and (re)introducing the LNA, but it's hugely important and we're actively working on it.

I think you're right about the fundamental problem that's leading to the bias. Put another way, the discretization interval needs to be sufficiently small such that the locally linear approximation holds. I suppose that this obligates us to pick an interval that is small enough to satisfy this over the course of the modeling period. I'm actually not sure which way it would cut if you had seasonal outbreaks over many years. I could see that approximation errors would compound, but also that you might be able to get away with a coarser interval if fitting the model to data since you have more replications. But, maybe it also depends on how you model the seasonality and whether/how you borrow information across years. There are probably cases where you'd want the interval to vary for computational reasons and also because the data aren't really informative over certain periods so who cares, e.g., big intervals over summers if modeling flu data. Would be interesting to explore!

kingaa commented 4 years ago

Excellent points, well made. I've appreciated mulling these things over with you. Who knows, maybe this will even be useful to someone else reading this exchange somewhere down the line!