Incorporating known states in the model

kingaa / pomp

R package for statistical inference using partially observed Markov processes

https://kingaa.github.io/pomp

GNU General Public License v3.0

111 stars 27 forks source link

Incorporating known states in the model #65

Closed MarieAugerMethe closed 6 years ago

MarieAugerMethe commented 6 years ago

Hello,

I'm sorry this could be again a newby question.

Let say I have perfect information of the state values for some time steps. For example, the state value (state_x) that I am predicting with the particle filter is the true location of an animal tagged with some tracking technology that gives imprecise observations (obs_x), but I know exactly where I captured the animal when I put the tag on it and where I recaptured the animal later in the year. So for some ts, I know state_x and have my imprecise obs_x. Is there a way to specify in the particle filter function that I know certain state values? I'm actually not sure whether a simple particle filter could handle this constraint. So is there another way around this issue?

Here is a quick example of a simple pomp object and simulation creating the data I would like to use. However, I'm not sure how one could specify that some of the state values are known in the particle filter.

ss_step <- Csnippet('
                    double step_x = 0.0;
                    step_x = rnorm(0, sigma_state);
                    state_x = state_x + step_x;
                    ')

r_obs <- Csnippet('
  obs_x = rnorm(state_x, sigma_obs);
')

d_obs <- Csnippet('
  if (give_log) {
     lik = dnorm(obs_x, state_x, sigma_obs, give_log);
  } else {
    lik = dnorm(obs_x, state_x, sigma_obs, give_log);
  }
')

log_trans <- Csnippet("
  Tsigma_obs = log(sigma_obs);
                      Tsigma_state = log(sigma_state);
                      ")

exp_trans <- Csnippet("
                      Tsigma_obs = exp(sigma_obs);
                      Tsigma_state = exp(sigma_state);
                      ")

n_state <- 1000                                   # Number of observations
s_obs <- 0.1                                      # SD of the observation process
s_state <- 0.05                                   # SD of the hidden process
state_x_0 <- 0                       # Initial state

dat <- data.frame(time = 1, obs_x = 0) # Dummy data to provide observation name

# Create pomp object
pomp_obj <- pomp(data = dat, time="time", t0=1, 
                 rprocess = discrete.time.sim(ss_step, delta.t=1), 
                 rmeasure = r_obs,
                 dmeasure = d_obs,
                 statenames=c("state_x"),
                 paramnames=c("sigma_state", "sigma_obs"), 
                 toEstimationScale = log_trans,
                 fromEstimationScale = exp_trans)

# Simulate 1 chain
SIM <- simulate(pomp_obj, nsim=1, times=1:n_state,
                params=c(sigma_state=s_state, sigma_obs=s_obs, state_x.0=state_x_0))

# Say that in our ecological data set we know the value of the state_x at time 1 and end of time series and maybe somewhere in between
known_state_t <- c(1,500,1000)
known_state_x <- SIM@states["state_x", known_state_t <- c(1,500,1000)]

A simple particle filter that doesn't use the known state information would be:

pf <- pfilter(SIM, Np=1000,
              params=c(sigma_state=s_state, sigma_obs=s_obs, state_x.0=0), 
              filter.mean=TRUE, filter.traj=TRUE)

How would I use the information in known_state_x in a particle filter?

Many thanks!

Marie

kingaa commented 6 years ago

@MarieAugerMethe,

It's an interesting question you ask. There are two levels to it at least:

How do we get such a thing into pomp?
What can we do with it once we do?

For the first question, what about the idea of coding up the two kinds of observations as distinct data variables? I.e., you would have your tracking data and your capture data as separate variables. The measurement error on the first would be quite a bit larger than that on the second, which you could even assume to be zero, I suppose.

I'm imagining something like the following.

library(pomp)

ss_step <- Csnippet('
  double step_x = 0.0;
  step_x = rnorm(0, sigma_state);
  state_x = state_x + step_x;
  ')

r_obs <- Csnippet('
  track_x = rnorm(state_x, sigma_track);
  capture_x = rnorm(state_x, sigma_capture);
  ')

d_obs <- Csnippet('
  if (give_log) {
    lik = dnorm(track_x, state_x, sigma_track, 1) +
      ((R_FINITE(capture_x)) ? dnorm(capture_x, state_x, sigma_capture, 1) : 0);
  } else {
    lik = dnorm(track_x, state_x, sigma_track, 0) *
      ((R_FINITE(capture_x)) ? dnorm(capture_x, state_x, sigma_capture, 0) : 1);
  }
  ')

n_state <- 1000      # Number of observations
s_obs <- 0.1         # SD of the observation process
s_state <- 0.05      # SD of the hidden process
state_x_0 <- 0       # Initial state

## Create pomp object
SIM <- simulate(
  pomp(
    data = data.frame(time = 1, track_x = 0, capture_x = 0),
    time="time", t0=1,
    rprocess = discrete.time.sim(ss_step, delta.t=1),
    rmeasure = r_obs,
    dmeasure = d_obs,
    statenames=c("state_x"),
    paramnames=c("sigma_state", "sigma_track","sigma_capture")
  ),
  times=1:n_state,
  params=c(
    sigma_state=s_state,
    sigma_track=s_obs,
    sigma_capture=0,
    state_x.0=state_x_0
    ),
  as.data.frame=TRUE)

## Say that in our ecological data set we know the value of the state_x at time 1
## and end of time series and maybe somewhere in between:
SIM$capture_x <- ifelse(SIM$time %in% c(1,500,1000), SIM$capture_x, NA)

library(ggplot2)
ggplot(data=SIM,aes(x=time))+
  geom_line(aes(y=track_x))+
  geom_point(aes(y=capture_x),color='red',size=3)+
  theme_bw()

pomp_obj <- pomp(
  data = SIM[c("time","track_x","capture_x")],
  time="time", t0=1,
  rprocess = discrete.time.sim(ss_step, delta.t=1),
  rmeasure = r_obs,
  dmeasure = d_obs,
  statenames=c("state_x"),
  paramnames=c("sigma_state", "sigma_track","sigma_capture")
)

As for the second question, as you say, one wonders whether the tight constraints on the state at a few points will hurt simulation-based methods like the particle filter. If the error SDs on the two types of data are very different, then one could imagine the particle filter having a very hard time. However, pomp has many other methods besides that one!

Of course, in the very specific case you show here, it is possible to build a Brownian bridge that would enforce the constraint of being at known locations at specific times. The problem of course is that if the model gets a bit more interesting, this is no longer so easy.

MarieAugerMethe commented 6 years ago

@kingaa

Thank you! Based on your example (only very slightly modified), it looks like the particle filter at least works when the standard deviation of the captures is really small, but I get an error when it's set to 0.


library(pomp)

ss_step <- Csnippet('
                    double step_x = 0.0;
                    step_x = rnorm(0, sigma_state);
                    state_x = state_x + step_x;
                    ')

r_obs <- Csnippet('
                  track_x = rnorm(state_x, sigma_track);
                  capture_x = rnorm(state_x, sigma_capture);
                  ')

d_obs <- Csnippet('
                  if (give_log) {
                  lik = dnorm(track_x, state_x, sigma_track, 1) +
                  ((R_FINITE(capture_x)) ? dnorm(capture_x, state_x, sigma_capture, 1) : 0);
                  } else {
                  lik = dnorm(track_x, state_x, sigma_track, 0) *
                  ((R_FINITE(capture_x)) ? dnorm(capture_x, state_x, sigma_capture, 0) : 1);
                  }
                  ')

n_state <- 1000      # Number of observations
s_obs <- 0.1         # SD of the observation process
s_state <- 0.05      # SD of the hidden process
s_capture <- 0.000001      # SD of the capture set to very small values works
# s_capture <- 0      # SD of the capture set to 0 returns error, see below.
state_x_0 <- 0       # Initial state

## Create pomp object
SIM <- simulate(
  pomp(
    data = data.frame(time = 1, track_x = 0, capture_x = 0),
    time="time", t0=1,
    rprocess = discrete.time.sim(ss_step, delta.t=1),
    rmeasure = r_obs,
    dmeasure = d_obs,
    statenames=c("state_x"),
    paramnames=c("sigma_state", "sigma_track","sigma_capture")
  ),
  times=1:n_state,
  params=c(
    sigma_state=s_state,
    sigma_track=s_obs,
    sigma_capture=s_capture,
    state_x.0=state_x_0
  ),
  as.data.frame=TRUE)

## Say that in our ecological data set we know the value of the state_x at time 1
## and end of time series and maybe somewhere in between:
SIM$capture_x <- ifelse(SIM$time %in% c(1,500,1000), SIM$capture_x, NA)

library(ggplot2)
ggplot(data=SIM,aes(x=time))+
  geom_line(aes(y=track_x))+
  geom_point(aes(y=capture_x),color='red',size=3)+
  theme_bw()

pomp_obj <- pomp(
  data = SIM[c("time","track_x","capture_x")],
  time="time", t0=1,
  rprocess = discrete.time.sim(ss_step, delta.t=1),
  rmeasure = r_obs,
  dmeasure = d_obs,
  statenames=c("state_x"),
  paramnames=c("sigma_state", "sigma_track","sigma_capture")
)

# Now let's look at if it works with the particle filter
# Here we use the real parameter values to make things easy
pf <- pfilter(pomp_obj, Np=1000,
              params=c(
                sigma_state=s_state,
                sigma_track=s_obs,
                sigma_capture=s_capture,
                state_x.0=state_x_0
              ),
              filter.mean=TRUE, filter.traj=TRUE)

plot(SIM$state_x, ty="l", col="red", pch=19, cex=0.5)
points(filter.traj(pf), ty="l", col="blue", pch=19, cex=0.5)

The error I get when I use sigma_capture = 0 is:

Error: in ‘pfilter’: ‘dmeasure’ returns non-finite value.
likelihood, data, states, and parameters are:
         time:            1
          lik:          Inf
      track_x:    -0.124159
    capture_x:            0
      state_x:            0
  sigma_state:         0.05
  sigma_track:          0.1
sigma_capture:            0
    state_x.0:            0

I feel like it's not surprising since a normal distribution with sd of 0 is a bit odd and the probability at the mean value (i.e. the known state value) will be infinite, which is throwing a monkey wrench into things.

The small sd values is a good trick, but it would be nice to have it completely fixed. So I'm curious about the brownian bridge option. I did a quick look in the pomp documentation and I don't see any specific mention to it. Is this something already available, or something I would need to code from scratch?

Thanks!

kingaa commented 6 years ago

That's exactly right. With s.d. = 0, the normal distribution is degenerate, hence likelihoods are either infinity or 0. The filter can't handle this. Taking the error distribution s.d. nonzero but small results in a non-degenerate, but highly peaked distribution, which might or might not cause practical problems for the filter. These would show up as a need for an inconveniently large number of particles to avoid filtering failures. It's the curse of too much information!

The Brownian bridge is highly model specific and, as such, not implemented anywhere in pomp. However, it wouldn't be hard to implement it as a part of the model. A simple Brownian bridge, however, really assumes an unrestricted space, so you might still have problems with seals hitchhiking across Newfoundland.

If the naive particle filter kicks up rough because the particles never seem to find the right capture location, you could get clever, tricking pomp into implementing a guided importance sampler. For example, you could introduce a bias into the process model (which is what a Brownian bridge does, of course). You could compensate for this by adjusting the observation model's reported likelihood. See what I mean? I think you still might have to get very clever to deal with coastlines and such, but I haven't thought it all the way through.

kingaa commented 6 years ago

Looking at your code a little more, I noticed an interesting thing. The error you are getting is not there because of failure in the filter, it is there because in your simulation of the data, you use sigma_capture=0. This has the effect of making the initial condition exact, for the capture data, which leads to infinite likelihood. So it is a purely technical problem. In particular, if your initial condition is not observed (as for example in the code below), then you get way with only a small number of filtering failures, even with sigma_capture=0.

n_state <- 1000      # Number of observations
s_obs <- 0.1         # SD of the observation process
s_state <- 0.05      # SD of the hidden process
s_capture <- 0      # SD of the capture set to 0
state_x_0 <- 0       # Initial state

## Create pomp object
SIM <- simulate(
  pomp(
    data = data.frame(time = 1, track_x = 0, capture_x = 0),
    time = "time", t0 = 0,
    rprocess = discrete.time.sim(ss_step, delta.t=1),
    rmeasure = r_obs,
    dmeasure = d_obs,
    statenames=c("state_x"),
    paramnames=c("sigma_state", "sigma_track","sigma_capture")
  ),
  times=1:n_state,
  params=c(
    sigma_state=s_state,
    sigma_track=s_obs,
    sigma_capture=s_capture,
    state_x.0=state_x_0
  ),
  as.data.frame=TRUE)

## Say that in our ecological data set we know the value of the state_x at time 1
## and end of time series and maybe somewhere in between:
SIM$capture_x <- ifelse(SIM$time %in% c(1,500,1000), SIM$capture_x, NA)

library(ggplot2)
ggplot(data=SIM,aes(x=time))+
  geom_line(aes(y=track_x))+
  geom_point(aes(y=capture_x),color='red',size=3)+
  theme_bw()

pomp_obj <- pomp(
  data = SIM[c("time","track_x","capture_x")],
  time="time", t0=1,
  rprocess = discrete.time.sim(ss_step, delta.t=1),
  rmeasure = r_obs,
  dmeasure = d_obs,
  statenames=c("state_x"),
  paramnames=c("sigma_state", "sigma_track","sigma_capture")
)

# Now let's look at if it works with the particle filter
# Here we use the real parameter values to make things easy
pf <- pfilter(pomp_obj, Np=1000,
              params=c(
                sigma_state=s_state,
                sigma_track=s_obs,
                sigma_capture=s_capture,
                state_x.0=state_x_0
              ),
              filter.mean=TRUE)

ggplot(data=as.data.frame(pf),mapping=aes(x=time))+
  geom_line(aes(y=track_x),color='green')+
  geom_point(aes(y=capture_x),color='blue')+
  geom_line(aes(y=state_x),color='red')+
  theme_bw()

Of course, the real test will be whether the filtering failures become a big problem when you are not at the true parameters....

ionides commented 6 years ago

To add to Aaron's suggestions, a situation where the measurement has no error is dealt with in https://ionides.github.io/531w18/14/notes14.html This is written from a slightly different perspective (a measurement with no error and an observation affecting the state dynamics are related model features, since a natural way to coerce the latter into the pomp framework is to include the observation in the state process, in which case we obtain the former). A related issue is addressed in https://academic.oup.com/mbe/article/34/8/2065/3200416 where the supplement develops a partially plug-and-play approach, in the situation where the perfectly measured component of the state space has an available transition density. Perhaps in some situations where this is not exactly true, there may be a reasonable approximation for which it is.

MarieAugerMethe commented 6 years ago

Thank you @kingaa and @ionides!

@kingaa I was surprised that we didn't get a infinite value for the likelihood for the time steps we have the capture data and the sd set to 0, but it's because the estimates from the particle filter are not exactly the same value as that of the capture value/state value.

pf <- pfilter(pomp_obj, Np=1000,
              params=c(
                sigma_state=s_state,
                sigma_track=s_obs,
                sigma_capture=s_capture,
                state_x.0=state_x_0
              ),
              filter.mean=TRUE, save.states=TRUE)

# Is the mean the same?
cbind(SIM$state_x[c(1,500,1000)], SIM$capture_x[c(1,500,1000)], filter.mean(pf)[c(1,500,1000)])

# Is any of the particles the same?
pp <- matrix(unlist(pf$saved.states), ncol=1000)
any(pp[1,] == SIM$state_x[1])
any(pp[500,] == SIM$state_x[500])
any(pp[1000,] == SIM$state_x[1000])

I feel it means that the contribution of the capture information is null since when the sd is set to 0 and the values are not exactly the same dnorm is going to return 0. I think this model is then effectively equivalent to a model without the capture information, no? From a very quick and dirty simulation test, it looks like we are better of with a really small sd rather than an sd = 0.

Sorry I misunderstood the Brownian Bridge reference. I thought you were thinking of using a brownian bridge in the particle filter itself not in the process equation. Something where there is a bias in the sampler. For now, I would rather stick with this model, because the real model is more complicated and incorporating the brownian bridge might make it a bit unwieldy.

On a different note, I don't understand why having exact initial condition is a problem. Does it have to do with the initializer? In movement ecology, knowing the initial state is easier than knowing the other states.

@ionides Thank you for the interesting example. I'm not sure I understand all of the details, but I can definitely see how to change the r_process, but I'm not sure I understand how to change the d_measure. If I understand well, because there is no measurement error, you write the process stochasiticy in the measurement. Can you give me an additional hint?

Thanks!

Marie

kingaa commented 6 years ago

@MarieAugerMethe

You write

I was surprised that we didn't get a infinite value for the likelihood for the time steps we have the capture data and the sd set to 0, but it's because the estimates from the particle filter are not exactly the same value as that of the capture value/state value.

Actually, the key change that I made to your model was to set the initial time (t0, the time at which the initial conditions obtain) to 0. It had been set at 1 in your original model. Therefore, under the stochastic state process, there is diversity in the underlying state at the time (t = 1) of the first observation. As you correctly observed, with capture_sigma = 0, all of the simulations are therefore incompatible with the capture data (unless of course, the probability-zero event happens that the state variables perfectly agree with the capture data). This avoids the error that you had been seeing, which was due not to the incompatibility, but to the perfect agreement (likelihood = infinity) at t = 1, enforced by your initial conditions.

You are correct that with capture_sigma = 0, you might as well not have the capture data at all, since they contribute no information. In effect, these data refuse to teach the particles nothing since none of the particles can meet their standard of perfection. In fact, it is worse than that, because the presence of a capture observation at any time obliterates the information in the tracking data stream.

For this reason, it might be interesting to actually inflate the capture-error s.d. At least at that point, it would contribute something. Alternatively, you could replace the tracking data at those special time-points with the capture data, specifying that the s.d. should be smaller than it is for the tracking data, but not zero.

MarieAugerMethe commented 6 years ago

Thank you @kingaa !

Ok, I think the way forward is the slightly inflated sd. We want the ~~tracking~~ capture data at that point, because, in theory, it should help estimate the magnitude of the tracking measurement error. Then we don't have any problem with knowing the initial state at the time of the first observation. I'm still slightly confused about the initial condition situation, because I thought the default was a deterministic state. Just to make sure I understand, when you input an initial state value, e.g. state_x.0 = 0 in the params argument, it sets that the state value to that value, correct? If you say that t0=0, then you get different values for the state at time t1 (the time of your observation) and the values are sampled base on your process equation, if you set t0=1, then you get the fix value at the time of the first observation.

By the way, just for posterity, I think the way to model it with the small sd, is to use a normal distribution with a small sd only for the dmeasure function, but set it equal in the rmeasure, as follows:

library(pomp)

ss_step <- Csnippet('
                    double step_x = 0.0;
                    step_x = rnorm(0, sigma_state);
                    state_x = state_x + step_x;
                    ')

r_obs <- Csnippet('
                  track_x = rnorm(state_x, sigma_track);
                  capture_x = state_x;
                  ')

d_obs <- Csnippet('
                  if (give_log) {
                  lik = dnorm(track_x, state_x, sigma_track, 1) +
                  ((R_FINITE(capture_x)) ? dnorm(capture_x, state_x, sigma_capture, 1) : 0);
                  } else {
                  lik = dnorm(track_x, state_x, sigma_track, 0) *
                  ((R_FINITE(capture_x)) ? dnorm(capture_x, state_x, sigma_capture, 0) : 1);
                  }
                  ')

n_state <- 1000      # Number of observations
s_obs <- 0.1         # SD of the observation process
s_state <- 0.05      # SD of the hidden process
s_capture <- 0.0001      # SD of the capture set to 0
state_x_0 <- 0       # Initial state

## Create pomp object
SIM <- simulate(
  pomp(
    data = data.frame(time = 1, track_x = 0, capture_x = 0),
    time = "time", t0 = 1,
    rprocess = discrete.time.sim(ss_step, delta.t=1),
    rmeasure = r_obs,
    dmeasure = d_obs,
    statenames=c("state_x"),
    paramnames=c("sigma_state", "sigma_track","sigma_capture")
  ),
  times=1:n_state,
  params=c(
    sigma_state=s_state,
    sigma_track=s_obs,
    sigma_capture=s_capture,
    state_x.0=state_x_0
  ),
  as.data.frame=TRUE)

## Say that in our ecological data set we know the value of the state_x at time 1
## and end of time series and maybe somewhere in between:
SIM$capture_x <- ifelse(SIM$time %in% c(1,500,1000), SIM$capture_x, NA)

library(ggplot2)
ggplot(data=SIM,aes(x=time))+
  geom_line(aes(y=track_x))+
  geom_point(aes(y=capture_x),color='red',size=3)+
  theme_bw()

pomp_obj <- pomp(
  data = SIM[c("time","track_x","capture_x")],
  time="time", t0=1,
  rprocess = discrete.time.sim(ss_step, delta.t=1),
  rmeasure = r_obs,
  dmeasure = d_obs,
  statenames=c("state_x"),
  paramnames=c("sigma_state", "sigma_track","sigma_capture")
)

# Now let's look at if it works with the particle filter
# Here we use the real parameter values to make things easy
pf <- pfilter(pomp_obj, Np=1000,
              params=c(
                sigma_state=s_state,
                sigma_track=s_obs,
                sigma_capture=s_capture,
                state_x.0=state_x_0
              ),
              filter.mean=TRUE, save.states=TRUE)

cbind("Sim_state"= SIM$state_x[c(1,500,1000)], 
      "Sim_capture" = SIM$capture_x[c(1,500,1000)], 
      "Predicted_state" =filter.mean(pf)[c(1,500,1000)])

Thanks!!

Marie

kingaa commented 6 years ago

@MarieAugerMethe, your understanding (quoted below) is correct. The default initializer is deterministic. You also correctly summarize the reason you get an error with t0=1 and no error with t0=0.

I'm still slightly confused about the initial condition situation, because I thought the default was a deterministic state. Just to make sure I understand, when you input an initial state value, e.g. state_x.0 = 0 in the params argument, it sets that the state value to that value, correct? If you say that t0=0, then you get different values for the state at time t1 (the time of your observation) and the values are sampled base on your process equation, if you set t0=1, then you get the fix value at the time of the first observation.

As for your plan with rmeasure and dmeasure, obviously this is possible. Just bear in mind that, since rmeasure and dmeasure will now be incompatible (i.e., they correspond to different models), so things like model checking via simulations will not be possible. However, in your case I don't think you'd be interested in doing that anyway, so as long as you don't forget that you've made these incompatible, you should have no problems.

kingaa commented 6 years ago

I'm closing this issue now, but feel free to open it back up for more discussion if you like. Cheers!