kingaa / pomp

R package for statistical inference using partially observed Markov processes
https://kingaa.github.io/pomp
GNU General Public License v3.0
111 stars 27 forks source link

Measurement model for multi-observed variables/states #157

Closed mehrdadfazli closed 3 years ago

mehrdadfazli commented 3 years ago

I wanted to ask for your suggestion as to what is the most effective way of defining a measurement model for more than one observable. Currently, what I am doing is to add the likelihoods like lik = lik_X + lik_Y. However, this gives rise to an issue which is that the scales of X and Y are very much different. So, the likelihood and consequently the parameter estimation would be dominated by one of the observables. I was wondering what is the best course of action here (take the log from the large variable, give weight to the likelihoods while adding them up, scaling variables to the same order, ...).

My second question is about the random walk standard deviation rw.sd. I figured (with trial and error) that mif fails unless the rw.sd for some parameters is 1e-3 or even less. I was just wondering if this is ok. Could it be that the model is so sensitive to some parameters?

kingaa commented 3 years ago

@mehrdadfazli: these are good questions.

With respect to your first question: The relative weight that is given to each of the observables is determined by the breadth (e.g., the variance) of the measurement distribution. You have unlimited ability to adjust this by adjusting your measurement model. In particular, if the measurement model is scaled appropriately, the fact that the observables have disparate scales is of no consequence.

(By the way, I suppose you mean you are adding the log likelihoods and multiplying the likelihoods? Are you avoiding the common mistake mentioned in the "Important Note" here?)

With respect to your second question: Have you seen this FAQ and this discussion about the choice of random-walk s.d.s? As you will see there, the question is about the scale of consequential variation of the parameters. Without knowing more about (1) how your parameters affect the dynamics of the model and (2) the how informative the data are about those dynamics, it is not possible to say more. I would be happy to try, but you will need to say more about your model, the data, etc.

mehrdadfazli commented 3 years ago

Thank you for your clarifying remarks. I now see how variance plays a role in governing the likelihood. However, do you suggest fixing the variance of the variables? As the model might not be able to strike a balance between the two and favors one variable more in estimation (after all its goal is to maximize the log lik). Following is my code (and the link to the data). I appreciate your insightful comments!

library(readr)
library(foreach)
library(doParallel)
numCores <- commandArgs(trailingOnly=TRUE)[1]
numCores <- as.numeric(numCores) - 1
registerDoParallel(cores=numCores)
print(paste('number of cores is ',numCores, ' for YV'))

library(doRNG)
registerDoRNG(625904618)
library(tidyverse)
library(pomp)
options(stringsAsFactors=FALSE)
stopifnot(packageVersion("pomp")>="3.0")

run_level <- 2
covid_Np <-          switch(run_level,100, 5e3, 5e4)
covid_Nmif <-        switch(run_level, 10, 100, 100)
covid_Nreps_eval <-  switch(run_level,  2,  10,  10)
covid_Nreps_local <- switch(run_level, 10,  10,  10)
covid_Nreps_global <-switch(run_level, 10,  20,  20)
covid_Nsim <-        switch(run_level, 50, 100, 100) 

set.seed(1350254336)
setwd("SET to Home directory")

cache_address = "SET to cach address"

#########################################################################
#--------------------------|  rproc  |----------------------------------#
#########################################################################
rproc <- Csnippet("
                  double beta, foi, dw, births, mu_SE;

                  //we only consider those that participate in the epidemic:
                  double pop = S + E + I + R;

                  // transmission rate
                  beta = b0;

                  // expected force of infection. iota: imported infections
                  // alpha mixing parameter, = 1:homogeneous mixing
                  foi = beta*pow(I+iota, alpha)/pop;

                  // white noise (extrademographic stochasticity)
                  dw = rgammawn(sigmaSE,dt);

                  mu_SE = foi*dw/dt;  // stochastic force of infection

                  // Poisson births: fraction of leak into S from N
                  births = rpois(br*dt);

                  // State Updates:
                  double dN_SE  = rbinom(S , 1-exp(-mu_SE  *dt));
                  double dN_EI  = rbinom(E , 1-exp(-mu_EI  *dt));
                  double dN_IR  = rbinom(I , 1-exp(-mu_IR *dt));
                  S += births - dN_SE;
                  E += dN_SE  - dN_EI;
                  I += dN_EI  - dN_IR;
                  R += dN_IR;
                  W += (dw - dt)/sigmaSE;  // standardized i.i.d. white noise
                  ")

#########################################################################
#--------------------------|  rinit  |----------------------------------#
#########################################################################
rinit <- Csnippet("
                  double m = eta*N;
                  S = nearbyint(m*S_0);
                  E = nearbyint(m*E_0);
                  I = nearbyint(m*I_0);
                  R = nearbyint(m*R_0);
                  W = 0;
                  ")

#########################################################################
#--------------------------|  dmeas  |----------------------------------#
#########################################################################
dmeas <- Csnippet("
                  // Model for Viral Load
                  double shed_cases = E + I;
                  double mu_V = rho_V*shed_cases;
                  double std_V = sqrt(mu_V*(1+od_V));
                  double lik_V = dnorm(V, mu_V, std_V, 1);

                  // Model for Case Counts
                  double mu_Y = rho_Y*I;
                  double std_Y = sqrt(mu_Y*(1+od_Y));
                  double lik_Y;

                  if (Y > 0.0) {
                    lik_Y = pnorm(Y+0.5,mu_Y,std_Y,1,1)
                           - pnorm(Y-0.5,mu_Y,std_Y,1,1);
                  } else {
                    lik_Y = pnorm(Y+0.5,mu_Y,std_Y,1,1);
                  }

                  // Combined likelihood
                  lik = lik_V + lik_Y;
                  //lik = lik_V;
                  //lik = lik_Y;
                  lik = (give_log) ? lik : exp(lik);

                  ")

#########################################################################
#--------------------------|  rmeas  |----------------------------------#
#########################################################################
rmeas <- Csnippet("
                  // Viral Load
                  double shed_cases = E + I;
                  double mu_V = rho_V*shed_cases;
                  double std_V = sqrt(mu_V*(1+od_V));
                  V = rnorm(mu_V, std_V);

                  // Case Counts
                  double mu_Y = rho_Y*I;
                  double std_Y = sqrt(mu_Y*(1+od_Y));
                  Y = rnorm(mu_Y, std_Y);
                  if (Y > 0.0) {
                    Y = nearbyint(Y);
                  } else {
                    Y = 0.0;
                  }

                  ")

#########################################################################
#-------------------------|  Load Data  |-------------------------------#
#########################################################################
data = read_csv("https://raw.githubusercontent.com/mehrdadfazli/COVID-SEIR/main/Data/abm.csv")

#########################################################################
#-------------------------|  Parameters  |------------------------------#
#########################################################################
parameters = c(
  "b0", "alpha", "iota",      
  "sigmaSE",                  
  "br",                       
  "mu_EI", "mu_IR",           
  "N",                        
  "eta",                      
  "rho_V", "od_V",            
  "rho_Y", "od_Y",            
  "S_0","E_0","I_0", "R_0")

par_trans = parameter_trans(
  log = c(
    "b0", "alpha", "iota","sigmaSE", "br",
    "rho_V", "od_V", "od_Y"),
  logit = c("mu_EI", "mu_IR","eta", "rho_Y"),
  barycentric=c("S_0","E_0","I_0", "R_0")
)
states = c("S", "E", "I", "R", "W")

#########################################################################
#-------------------------|  Covariates  |------------------------------#
#########################################################################

sdm_covar <- covariate_table(
  t=      data[["day"]],
  sdmm=   data[["sdm"]],
  event=  data[["events"]],
  order=  "constant",
  times=  "t"
)

# shifting case counts by the assumed reporting delay
rep_del = 5
data %>% mutate_at(c("Y_1"), 
                       tibble::lst("Y_1"=lead), 
                       n=rep_del) %>%
  mutate_at(c("Y_2"), 
            tibble::lst("Y_2"=lead), 
            n=rep_del)%>%
  mutate_at(c("Y_3"), 
            tibble::lst("Y_3"=lead), 
            n=rep_del)%>%
  mutate_at(c("Y"), 
            tibble::lst("Y"=lead), 
            n=rep_del)-> data_c

# focusing on the first peak for now
data <- data_c[1:100,]

#########################################################################
#-------------------------|  pomp Model  |------------------------------#
#########################################################################

covidSEIRsR = data %>%
  select(-logV) %>%
  rename(V = V,
         Y = Y
  ) %>%
  pomp(
    times = "day", # column name of data that corresponds to time
    t0 = 0,        # starting time
    # rprocess = discrete_time(rproc, delta.t=1), # daily
    rprocess = euler(rproc, delta.t=1/6), # every four
    rinit = rinit,
    rmeasure = rmeas,
    dmeasure = dmeas,
    accumvars= c("W"),
    partrans = par_trans,
    statenames = states,
    paramnames = parameters,
    covar=sdm_covar
  )

#########################################################################
#-------------------------|  Simulations  |-----------------------------#
#########################################################################

params_guess = c(
  b0=0.013, alpha=1.62, iota=5, 
  sigmaSE=0.8,
  br=2,
  mu_EI=.16, mu_IR=0.13, # state transition
  rho_V=150, od_V=100,                    # measurement V
  rho_Y=.14, od_Y=5,                   # measurement Y
  eta=.05, N=50000,                   # initial value parameters
  S_0=.95, E_0=.04, I_0=.01, R_0=.0)

y = covidSEIRsR %>%
  simulate(params=params_guess, nsim=50, format="data.frame")

y_avg = y %>% group_by(day) %>% summarize_at(vars(S:R, V, Y), mean)

observed = data %>%
  mutate(actual.cases = Y / params_guess['rho_Y']) %>%
  select(day, V = V, Y = actual.cases) %>%
  pivot_longer(c(V, Y))

y %>% pivot_longer(c(V, Y)) %>%
  ggplot(aes(x = day, y = value)) +
  geom_line(aes(color = factor(.id))) +
  geom_line(data = y_avg %>% pivot_longer(c(V, Y)),
            size=2, color="blue") +
  geom_line(data = observed, color="black", size=2) +
  scale_color_brewer(type = 'qual', palette = 3) +
  guides(color = FALSE) +
  facet_wrap(~name, scales="free_y")

#########################################################################
#----------------------|  Particle Filtering  |-------------------------#
#########################################################################

tic <- Sys.time()

L_pf=0
pf=0
foreach(i=1:10,.combine=c) %dopar% {
  library(pomp)
  covidSEIRsR %>% pfilter(params=params_guess,Np=100)
} -> pf

pf %>% logLik() %>% logmeanexp(se=TRUE) -> L_pf
L_pf
toc <- Sys.time()

#########################################################################
#-----------------------------| local |---------------------------------#
#########################################################################

registerDoRNG(482947940)
bake(file=paste(cache_address,"/local_search_simple.rds", sep = ''),{
  foreach(i=1:covid_Nreps_local,.combine=c, .errorhandling="remove") %dopar% {
    library(pomp)
    library(tidyverse)
    covidSEIRsR %>%
      mif2(
        params=params_guess,
        Np=covid_Np, Nmif=covid_Nmif,
        cooling.fraction.50=0.5,
        rw.sd=rw.sd(b0=0.01, alpha=0.01, iota=0.01,
                    sigmaSE=0.01, br=0.001, mu_EI=0.00, mu_IR=0.00, 
                    S_0=ivp(0.00), E_0=ivp(0.00), I_0=ivp(0.00), R_0=ivp(0.00), 
                    eta=ivp(0.00), rho_V=0.01, rho_Y=0.0, od_V=0.01, od_Y=0.01)
      ) %>%
      mif2(cooling.fraction.50=0.3) %>%
      mif2(cooling.fraction.50=0.1)
  } -> mifs_local
  attr(mifs_local,"ncpu") <- getDoParWorkers()
  mifs_local
}) -> mifs_local
t_loc <- attr(mifs_local,"system.time")
ncpu_loc <- attr(mifs_local,"ncpu")

#plotting the parallel tasks for the local search}
mifs_local %>%
  traces() %>%
  melt() %>%
  ggplot(aes(x=iteration, y=value, group=L1, color=factor(L1)))+
  geom_line()+
  guides(color=FALSE)+
  facet_wrap(~variable,scales="free_y")

#parallel runs for calculating exact likelihood}
registerDoRNG(900242057)
bake(file=paste(cache_address,"/lik_local_simple.rds", sep = ''),{
  foreach(mf=mifs_local,.combine=rbind, .errorhandling="remove") %dopar% {
    library(pomp)
    library(tidyverse)
    evals <- replicate(covid_Nreps_local, logLik(pfilter(mf,Np=covid_Np)))
    ll <- logmeanexp(evals,se=TRUE)
    mf %>% coef() %>% bind_rows() %>%
      bind_cols(loglik=ll[1],loglik.se=ll[2])
  } -> results
  attr(results,"ncpu") <- getDoParWorkers()
  results
}) -> results
t_local <- attr(results,"system.time")
ncpu_local <- attr(results,"ncpu")
# pairs(~loglik+b0+alpha+iota+br+rho_V+od_V,data=results,pch=16)

#########################################################################
#---------------------------| Global |----------------------------------#
#########################################################################

set.seed(532718320)
#initial guesses, creating parameter ranges
runif_design(
  lower=c(b0=0.0001,  od_Y=1,       rho_V=20,  od_V=0,
          alpha=0.5,  iota=1.0, sigmaSE=0.1, br=1.0),
  upper=c(b0=1.0000,  od_Y=100., rho_V=200, od_V=100,
          alpha=1.5,  iota=20., sigmaSE=1.5, br=20.),
  nseq=50
) -> guesses

mf1 <- mifs_local[[1]]

fixed_params <- c(N=50000, mu_EI=.16, mu_IR=0.13, rho_Y=0.14, S_0=.95, E_0=.04, I_0=.01, R_0=.0, eta=0.05)

#run global search}
bake(file=paste(cache_address,"/global_search_simple.rds", sep = ''),{
  registerDoRNG(1270401374)
  foreach(guess=iter(guesses,"row"), .combine=rbind, .errorhandling="remove") %dopar% {
    library(pomp)
    library(tidyverse)
    mf1 %>%
      mif2(params=c(unlist(guess),fixed_params),
           Np=covid_Np, Nmif=covid_Nmif,
           cooling.fraction.50=0.5,
           rw.sd=rw.sd(b0=0.01, alpha=0.01, iota=0.01,
                       sigmaSE=0.001, br=0.001, mu_EI=0.00, mu_IR=0.00, 
                       S_0=ivp(0.00), E_0=ivp(0.00), I_0=ivp(0.00), R_0=ivp(0.00), 
                       eta=ivp(0.00), rho_V=0.01, rho_Y=0.0, od_V=0.0, od_Y=0.01)) %>%
      mif2(cooling.fraction.50=0.3) %>%
      mif2(cooling.fraction.50=0.1) -> mf
    replicate(
      covid_Nreps_global,
      mf %>% pfilter(Np=covid_Np) %>% logLik()
    ) %>%
      logmeanexp(se=TRUE) -> ll
    mf %>% coef() %>% bind_rows() %>%
      bind_cols(loglik=ll[1],loglik.se=ll[2])
  } -> results
  attr(results,"ncpu") <- getDoParWorkers()
  results
}) %>%
  filter(is.finite(loglik)) -> results
t_global <- attr(results,"system.time")
t_gncpu_global <- attr(results,"ncpu")

write.table(results, "./results/searched_params_simple.csv", sep = ",",
            col.names = names(results),
            append = T)

# write.table(results, "./results/covid_params_simple.csv", sep = ",",
#             col.names = !file.exists("./results/covid_params_simple.csv"),
#             append = T)
kingaa commented 3 years ago

I now see how variance plays a role in governing the likelihood. However, do you suggest fixing the variance of the variables? As the model might not be able to strike a balance between the two and favors one variable more in estimation (after all its goal is to maximize the log lik).

The question of how to weight the variables is no different from any other question about the model. If your measurement model has adjustable parameters that control the variance of the error distribution, then one can estimate these parameters, by maximum likelihood for example, just as one would the other parameters.

It is worth thinking about where in the data the information about these parameters might lie....

kingaa commented 3 years ago

I notice that, in your measurement model, you are assuming that the variance of the measurement error scales linearly with the mean. What is your rationale?

mehrdadfazli commented 3 years ago

Well, it was a first guess trying to make sure we have enough variability in the model. But now that I am thinking maybe it was not a good idea (for V at least) and maybe that is why V has a dominant log lik. I will try to set the variance to a single parameter.

Thanks a lot

kingaa commented 3 years ago

That might be good. It is very often the case, however, that the variance scales with the mean, either linearly (as you have assumed) or quadratically.

In order to maximize the likelihood by iterated filtering, it is first necessary to be able to filter. If you ensure that your model has enough measurement error that it is able to filter, then the mif2 algorithm can always reduce the variability if it increases the likelihood by doing so.

mehrdadfazli commented 3 years ago

Something I experienced with my model (not exactly the one posted) was that filtering was working fine and gave me a reasonable likelihood. But when I applied iterated filtering it would fail (with Inf or NaN value for lik and/or some parameters). Sometimes, reducing the rw.sd could solve the issue. That is why I ended up with very small values of rw.sd for some parameters.

kingaa commented 3 years ago

Are you transforming the parameters when applying mif2? If you don't, then when mif2 proposes, for example, negative transmission rates, the model produces nonsense, no?

mehrdadfazli commented 3 years ago

I do transform parameters to avoid negative values. But, since there is no upper bound sometimes some of the parameters explode which results in an illegal value for the likelihood.

par_trans = parameter_trans(
  log = c(
    "b0", "alpha", "iota","sigmaSE", "br",
    "rho_V", "od_V", "od_Y"),
  logit = c("mu_EI", "mu_IR","eta", "rho_Y"),
  barycentric=c("S_0","E_0","I_0", "R_0")
)
kingaa commented 3 years ago

Interesting. Which parameters explode?

mehrdadfazli commented 3 years ago

It was not for this specific model. I will open another ticket if that happens again. Thank you so much for your prompt responses.

kingaa commented 3 years ago

I will close this issue, then. Feel free to reopen if more discussion would be useful.