encounter problems when conducting trajectory fitting

kingaa / pomp

R package for statistical inference using partially observed Markov processes

https://kingaa.github.io/pomp

GNU General Public License v3.0

114 stars 27 forks source link

encounter problems when conducting trajectory fitting #166

Closed stevenhuyi closed 2 years ago

stevenhuyi commented 3 years ago

I am trying to use an SEIR model in pomp to fit my observed seiries of HFMD data. I encounter an error when constructing the skeleton argument and my codes are as follows:

###############construct pomp model###########
rmeas <- "
  cases = rnbinom_mu(1/theta, rho * H);
"
dmeas <- "
  lik = dnbinom_mu(cases, 1/theta, rho * H, give_log);
"
seas.seir.determin <- "
  double schl;
  double seas;
  double Beta;
  double rate[8];
  double dN[8];

  // school terms
  t = (t-floor(t))*365.25;
  if ((t>=4&&t<=31) || (t>=60&&t<=180) || (t>=243&&t<=272) || (t>=280&&t<=356.25))
      schl = 1.0+amplitude*0.2987/0.7013;  //the fraction of year children in school is 0.7013 
    else
      schl = 1.0-amplitude;
  seas = exp(a*PC1 + b*PC2); // climatic effect
  Beta = seas*schl;

  rate[0] = births;                // birth
  rate[1] = Beta*pow(I + iota,alpha) / pop; // infection
  rate[2] = mu;                    // death from S
  rate[3] = delta;                 // advance from E to I
  rate[4] = mu;                    // death from E
  rate[5] = gamma;                 // recovery from I to R
  rate[6] = mu;                    // death from I
  rate[7] = mu;                    // death from R

  dN[0] = rate[0]*dt;
  dN[1] = rate[1]*S;
  dN[2] = rate[2]*S;
  dN[3] = rate[3]*E;
  dN[4] = rate[4]*E;
  dN[5] = rate[5]*I;
  dN[6] = rate[6]*I;
  dN[7] = rate[7]*R;

  DS = dN[0] - dN[1] - dN[2];
  DE = dN[1] - dN[3] - dN[4];
  DI = dN[3] - dN[5] - dN[6];
  DR = dN[5] - dN[7];
  DH = dN[3];
"

ind_case %>% 
  pomp(t0=2009,
       time="time",
       rinit =Csnippet("
                        double m = pop/(S_0+E_0+I_0+R_0);
                        S = nearbyint(m*S_0);
                        E = nearbyint(m*E_0);
                        I = nearbyint(m*I_0);
                        R = nearbyint(m*R_0);
                        H = 0;
                        "),
       skeleton = pomp::map(f = Csnippet(seas.seir.determin),delta.t=1/52/20),
       dmeasure=Csnippet(dmeas),
       rmeasure=Csnippet(rmeas),
       covar=covariate_table(ind_covar, times = "time"),
       accumvars="H",
       statenames=c("S","E","I","R","H"),
       paramnames=c("mu","delta","gamma","alpha","iota","theta",
                    "rho","amplitude","a","b","S_0","E_0","I_0","R_0")
  ) -> m1

here comes the error:

error: invalid operands to binary * (have 'double' and 'double (*)(double,  double,  int)')
   dN[0] = rate[0]*dt;
           ~~~~~~~^

As the births in the covariate dataframe is the annual accumulated number of the born, births * dt should be the number of individuals entering the susceptible at each dt step. What type is the "dt" variable? it seems that dt can not be used in binary calcualtion. I also changed it to a stochastic process using following code, but it works. Would you please tell me why? Thanks in advance.

seas.seir.stoch <- "
  double schl;
  double seas;
  double Beta;
  double rate[8];
  double dN[8];

  // term-time seasonality
  t = (t-floor(t))*365.25;
  if ((t>=4&&t<=31) || (t>=60&&t<=180) || (t>=243&&t<=272) || (t>=280&&t<=356))
      schl = 1.0+amplitude*0.2987/0.7013;  //the fraction of year children in school is 0.7013 
    else
      schl = 1.0-amplitude;

  seas = exp(a*PC1 + b*PC2) ;
  Beta = seas*schl;

  rate[0] = births;                // birth
  rate[1] = Beta*pow(I + iota,alpha) / pop; // infection
  rate[2] = mu;                    // death from S
  rate[3] = delta;                 // advance from E to I
  rate[4] = mu;                    // death from E
  rate[5] = gamma;                 // recovery from I to R
  rate[6] = mu;                    // death from I
  rate[7] = mu;                    // death from R

  dN[0] = rpois(rate[0] * dt);
  reulermultinom(2, S, &rate[1], dt, &dN[1]);
  reulermultinom(2, E, &rate[3], dt, &dN[3]);
  reulermultinom(2, I, &rate[5], dt, &dN[5]);

  S += dN[0] - dN[1] - dN[2];
  E += dN[1] - dN[3] - dN[4];
  I += dN[3] - dN[5] - dN[6];
  R = pop - S - E - I;
  H += dN[3];
"
ind_case %>% 
  pomp(t0=2009,
       time="time",
       rprocess=euler(step.fun = Csnippet(seas.seir.stoch),delta.t=1/52/20),
       rinit =Csnippet("
                        double m = pop/(S_0+E_0+I_0+R_0);
                        S = nearbyint(m*S_0);
                        E = nearbyint(m*E_0);
                        I = nearbyint(m*I_0);
                        R = nearbyint(m*R_0);
                        H = 0;
                        "),
       dmeasure=Csnippet(dmeas),
       rmeasure=Csnippet(rmeas),
       covar=covariate_table(ind_covar, times = "time"),
         ##accumvars="H",
       statenames=c("S","E","I","R","H"),
       paramnames=c("mu","delta","gamma","alpha","iota","theta",
                    "rho","amplitude","a","b","S_0","E_0","I_0","R_0")
  ) -> m2

ind_case.csv ind_covar.csv

kingaa commented 3 years ago

@stevenhuyi: thanks for the detailed report.

The error message arises because the dt variable you reference is not defined in the context in which the C snippet is executed, except as a function (the probability density function for the Student t-distribution, definitely not what you had in mind, no?). This explains the error: C doesn't know how to multiply (binary *) a number ('double') and a function ('double (*)(double, double, int)'). The solution is to define dt inside your C snippet. I can understand that you assumed it would be defined there, as it is in the rprocess snippet, but it is not. Call it a bug, call it a feature, but for a skeleton map snippet, if you want a timestep, you have to define it yourself inside the snippet as well as outside. Looking over the documentation, I see that, while dt is not mentioned as being available to the C snippet, neither is there anything saying that it isn't. I think I'll take a moment and add one.

Looking at your skeleton C snippet, I notice that it is probably not going to behave the way you intend. You seem to want to model it as a map, but many of the equations read like the specification of a vectorfield. In particular, in the following lines it does seem like the RHS should be proportional to the timestep.

  dN[1] = rate[1]*S;
  dN[2] = rate[2]*S;
  dN[3] = rate[3]*E;
  dN[4] = rate[4]*E;
  dN[5] = rate[5]*I;
  dN[6] = rate[6]*I;
  dN[7] = rate[7]*R;

  DS = dN[0] - dN[1] - dN[2];
  DE = dN[1] - dN[3] - dN[4];
  DI = dN[3] - dN[5] - dN[6];
  DR = dN[5] - dN[7];
  DH = dN[3];

Perhaps you can explain your rationale and I can offer advice?

stevenhuyi commented 3 years ago

Thanks for your instant reply, @kingaa. I think I should reply to your second question first. As my equations showed, I should specify a vectorfield in skeletion part. Actually, I didn't know exactly when to specify a vectorfield and when to a map at the beginning although I was sure that my equation is a differential one. Now, I think it should be specified as a vectorfield. Would you please also explained a bit more when to use a map although the documentation showed that it is used in a discrete process.

If the vectorfield is chose, how to define dt. You said that dt should be defined as a function in the C snippet. All I intended to do is to supplement the susceptible with the born at each (systematic) step. Hence, I used a births * dt term. However, if a vectorfield is used, there is no way to define the time step. One point puzzled me is that why I do not have to define the dt in the C snippet when conduct a stochastic state process as I did in the above m2. Is it because it is defined in euler(step.fun = Csnippet(seas.seir.stoch),delta.t=1/52/20)?

kingaa commented 3 years ago

This is one thing that is not complicated! The generator of a continuous-time dynamical system is a vectorfield; a map generates a discrete-time dynamical system.

When you specify the skeleton as a vectorfield and ask for a trajectory, pomp calls an ODE integrator from the deSolve package. This routine decides internally how to translate your specification of the rate of various events into the trajectory they imply. I suggest you read up on the documentation of the deSolve package.

You said that dtshould be defined as a function in the C snippet.

You misunderstood me. I did not say that dt should be defined as a function, only that it is defined as a totally unrelated function in the R C API, which is visible to every C snippet. If you specify the skeleton as a map, and want to refer to the step size, you must define it yourself (thereby overriding the reference to the p.d.f. of the Student t, which is not what you want). If your skeleton is a vectorfield, there is no stepsize and no reason for one.

In an rprocess snippet, dt is defined. You cannot control it (it is set by pomp, and can change), so its current value is passed to the C snippet when the snippet is executed.

stevenhuyi commented 3 years ago

In an rprocess snippet, dt is defined.

I am wondering where dt is defined, is it defined automatically within rprocess?

You cannot control it (it is set by pomp, and can change),

if so, what's the point of specifying delta.t in euler(stepfun,delta.t)) ?

Finally, I changed my codes as follows to estimate the parameters, however, it turned out some problems. I thought if it is the subjective starting value for parameters causing the problem. Would you help me out. Thanks very much.

rmeas <- "
  cases = rnbinom_mu(1/theta, rho * H);
"
dmeas <- "
  lik = dnbinom_mu(cases, 1/theta, rho * H, give_log);
"
seas.seir.determin <- "
  double schl;
  double seas;
  double Beta;
  double rate[8];
  double dN[8];

  // term-time seasonality
  t = (t-floor(t))365.25;
  if ((t>=4&&t<=31) || (t>=60&&t<=180) || (t>=243&&t<=272) || (t>=280&&t<=356.25))
  schl = 1.0+amplitude0.2987/0.7013; //the fraction of year children in school is 0.7013
  else
  schl = 1.0-amplitude;
  seas = exp(aPC1 + bPC2);
  Beta = seas*schl;

  rate[0] = births; // birth
  rate[1] = Beta*pow(I + iota,alpha) / pop; // infection
  rate[2] = mu; // death from S
  rate[3] = delta; // advance from E to I
  rate[4] = mu; // death from E
  rate[5] = gamma; // recovery from I to R
  rate[6] = mu; // death from I
  rate[7] = mu; // death from R

  dN[0] = rate[0];
  dN[1] = rate[1]*S;
  dN[2] = rate[2]*S;
  dN[3] = rate[3]*E;
  dN[4] = rate[4]*E;
  dN[5] = rate[5]*I;
  dN[6] = rate[6]*I;
  dN[7] = rate[7]*R;

  DS = dN[0] - dN[1] - dN[2];
  DE = dN[1] - dN[3] - dN[4];
  DI = dN[3] - dN[5] - dN[6];
  DR = dN[5] - dN[7];
  DH = dN[3];
"
ind_case %>%
  pomp(t0=2009,
    time="time",
    rinit =Csnippet("
      double m = pop/(S_0+E_0+I_0+R_0);
      S = nearbyint(mS_0);
      E = nearbyint(mE_0);
      I = nearbyint(mI_0);
      R = nearbyint(mR_0);
      H = 0;
"),
skeleton = vectorfield(Csnippet(seas.seir.determin)),
dmeasure=Csnippet(dmeas),
rmeasure=Csnippet(rmeas),
covar=covariate_table(ind_covar, times = "time"),
accumvars="H",
statenames=c("S","E","I","R","H"),
paramnames=c("mu","delta","gamma","alpha","iota","theta",
  "rho","amplitude","a","b","S_0","E_0","I_0","R_0")
) -> m1

pars = c(mu = 1/60,delta= 365/5,gamma = 365/8, alpha = 0.8,iota = 0.8, theta = 0.8, ###starting parameters
  rho = 0.5, amplitude = 0.5, a = 1.1, b = 1.2, S_0 = 0.056, E_0 = 0.002,
  I_0 = 0.002, R_0 = 0.94)

###objective function###
m1 %>%
  traj_objfun(
    est=c("alpha","iota","theta",
      "rho","amplitude","a","b","S_0","E_0","I_0","R_0"),
    params=pars,
    statenames=c("S","E","I","R","H"),
    paramnames=c("mu","delta","gamma","alpha","iota","theta",
      "rho","amplitude","a","b","S_0","E_0","I_0","R_0"),
    partrans=parameter_trans(log=c("alpha"),
      logit=c("rho","amplitude","theta"),
      barycentric=c("S_0","E_0","I_0","R_0"))
  ) -> ofun

library(subplex)
subplex(par=partrans(as(ofun,"pomp"),pars,dir="toEst"),fn=ofun) -> fit
fit

The minimized value of the function is NA.

kingaa commented 3 years ago

I am wondering where dt is defined, is it defined automatically within rprocess?

Yes, the step size is chosen by the pomp function that is evaluating your rprocess snippet.

what's the point of specifiying "delta.t" in euler(stepfucn,delta.t)) ?

As the documentation states, this is the maximum allowed value of the stepsize. Smaller steps will be taken if an integer number of steps of the specified size do not fit perfectly in the needed interval.

stevenhuyi commented 3 years ago

Thanks for your explainations, @kingaa, and they are very helpful. However, the results of my trajectory fitting are as follows which are very strange.

> fit 
$par
         mu       delta       gamma       alpha        iota       theta         rho 
  1.2644035 101.6945840  47.3708537  -1.3749715   0.4133130   1.5748155  -0.7518998 
  amplitude           a           b         S_0         E_0         I_0         R_0 
 -8.8479107 -11.4714574  10.1847661   1.4489795  -5.2893404  -6.2906773   0.1239168 

$value
[1] NaN

$counts
[1] 3925

$convergence
[1] 1

$message
[1] "limit of machine precision reached"

$hessian
NULL

I did not estimate mu, delta, and gamma, but they changed. Estimates of S_0 and R_0 are not reasonable. There also seems a problem of convergence. Thanks for your patience.

kingaa commented 3 years ago

Did you notice any warning messages? I did when I ran your code. You should be aware that, in some parameter regimes, your model may be stiff, which makes integration hard. I don't have the time to teach you about all of these things. You can read about numerical solution of ODE (the deSolve manual may be a good place to start). Also, you should be aware that nonlinear optimization is a hard problem, suffering from pathologies such as multiple local minima. It is typically necessary to put more effort into the search for the MLE than simply doing one optimization run.

Finally, it's a good idea to examine your model at the parameters that are causing problems and figure out why there are problems. For example, plug those parameters into trajectory and examine the result.

stevenhuyi commented 2 years ago

Hi, @kingaa . Thanks very much for you advice on estimating parameters in the non-linear dynamic system. I've read a lot of literature on this topic and pomp. Now, I‘m still working on my data. This time I use the mif2 function instead to carry out the parameter estimation as follows:

rmeas <- "
  cases = rpois(rho * H);
"
dmeas <- "
  lik = dpois(cases, rho * H, give_log);
"
seas.seir.stoch <- "
  // declare local variables
  double seas, foi, dw,  born, mu_SE;

  seas = r*exp(a*PC1 + b*PC2 + c*sin(2*M_PI*t) + d*sin(4*M_PI*t)); // seasonal variations

  // expected force of infection. iota: imported infections
  foi = seas*pow(I+iota, alpha)/pop;

  // white noise (extrademographic stochasticity)
  dw = rgammawn(sigmaSE,dt);

  mu_SE = foi*dw/dt;  // stochastic force of infection

  // Poisson births: fraction of leak into S from N
  born = rpois(births*dt);

  // rates for states:
  double dN_SE  = rbinom(S , 1-exp(-mu_SE  *dt));
  double dN_EI  = rbinom(E , 1-exp(-mu_EI  *dt));
  double dN_IR  = rbinom(I , 1-exp(-mu_IR  *dt));

  // states update:                
  S += born - dN_SE;
  E += dN_SE  - dN_EI;
  I += dN_EI  - dN_IR;
  H += dN_IR;
  W += (dw - dt)/sigmaSE;  // standardized i.i.d. white noise
"

ind_case %>% 
  pomp(t0=2009,
       time="time",
       rprocess=euler(step.fun = Csnippet(seas.seir.stoch),delta.t=1/365),
       rinit =Csnippet("
                        double m = pop/(S_0+E_0+I_0+R_0);
                        S = nearbyint(m*S_0);
                        E = nearbyint(m*E_0);
                        I = nearbyint(m*I_0);
                        H = 0;
                        W = 0;
                        "),
       dmeasure=Csnippet(dmeas),
       rmeasure=Csnippet(rmeas),
       covar=covariate_table(ind_covar, times = "time"),
       partrans=parameter_trans(log=c("alpha","iota","sigmaSE","r"),
                                logit=c("rho"),
                                barycentric=c("S_0","E_0","I_0","R_0")),
       accumvars="H",
       statenames=c("S","E","I","H","W"),
       paramnames=c("mu_EI","mu_IR","alpha","iota","rho","sigmaSE","r",
                    "a","b","c","d", "S_0","E_0","I_0","R_0")
  ) -> m1
####to guess the parameters #########
pars1 = c(mu_EI= 365/5,mu_IR = 365/8, alpha = 0.5,iota = 1,5, rho = 0.15, 
          sigmaSE = 0.05, a = 0.48, b = 0.3,  c = 0.45,  d = -0.3, r=140000,
          S_0 = 0.012, E_0 = 1.642694e-09,I_0 = 2.156669e-09, R_0 = 0.983)

m1 %>%                  ###simulate at starting parameters
  simulate(params=mle,nsim=100,format="data.frame",include.data=TRUE) %>%
  subset(time>0,select=c(time,.id,cases)) %>%
  melt(id=c("time",".id")) %>%
  ggplot(aes(x=time,y=value,color=ifelse(.id=="data","data","simulation"),
             group=.id))+
  labs(color="")+ geom_line()+
  facet_wrap(~variable,ncol=1,scales="free_y")+theme_bw()

###global search for parameter###
profile_design(mu_EI= 365/5,mu_IR = 365/8, type="sobol",
  lower=c(alpha = 0.1,iota = 1,  rho =0.1, sigmaSE = 0.01, r = 100,
          a = -5, b = -5, c=-5, d = -5,
          S_0 = 0.01, E_0 = 5.18335e-09,I_0 = 5.18335e-09,R_0=0.96),
  upper=c(alpha = 3,iota = 100, rho = 0.9,  sigmaSE=0.5, r = 5000,
          a = 5, b = 5,  c=5, d = 5,
          S_0 = 0.02, E_0 = 5.18335e-08,I_0 = 5.18335e-08,R_0=0.97),
  nprof =1000
) -> guesses

#####parallel computation###
library(doParallel)
cl <- detectCores() - 1 
registerDoParallel(cl)

library(doRNG)
registerDoRNG(123) ###to repeat our result

library(foreach)
foreach (start=iter(guesses,"row"),
         .combine=rbind, .packages=c("pomp"), 
         .errorhandling="remove", .inorder=FALSE) %dopar% {
           tic <- Sys.time()
           mf %>% 
             mif2(params=start)  -> mf1

             replicate(10, mf1 %>% pfilter() %>% logLik()) %>%
                logmeanexp(se=TRUE) -> ll

             toc <- Sys.time()
             etime <- toc-tic
             units(etime) <- "hours"

           data.frame(as.list(coef(mf1)),loglik=ll[1],
                      loglik.se=ll[2],t_elaps = as.numeric(etime))
         } -> ests

###more effects to look for the MLE####
ests <- read.csv("ests.csv",header=TRUE,sep=",")
ests %>%
  filter(!is.na(loglik)) %>%
  filter(rank(-loglik)<=100) %>%   ###top 100
  select(-loglik,-loglik.se,-t_elaps) -> starts

registerDoRNG(123) ###to repeat our result
foreach (start=iter(starts,"row"),
         .combine=rbind, .packages=c("pomp"), 
         .errorhandling="remove", .inorder=FALSE) %dopar% {

           mf %>% 
             mif2(params=start) %>%
             mif2() -> mf_fl

           replicate(10, mf_fl %>% pfilter() %>% logLik()) %>%
             logmeanexp(se=TRUE) -> ll

           data.frame(as.list(coef(mf_fl)),loglik=ll[1],loglik.se=ll[2])

         } -> ests_opt

stopImplicitCluster()  ###to release the clusters

###model validation###
ests_opt%>% 
  filter(loglik==max(loglik,na.rm = TRUE)) %>%
  as.vector()-> mle
m2<-m1
coef(m2) <- mle

m2 %>%
  probe(nsim=200,probes=list(
    mean=probe.mean("cases"),
    q=probe.quantile("cases",probs=c(0.05,0.25,0.5,0.75,0.95)),
    probe.acf("cases",lags=c(1,3),type="corr",transform=log)
  )) -> m2_probe

summary(m2_probe)
plot(m2_probe)

Although the overall fitting is relatively well the diagnostic test indicated that the 5% quantile, acf[1] and acf[3] of original data were far away from those of simulations (as follows)？I'm wondering if the fittting can be further improved?

seir

Additionally, I still have two further concerns:

First, it is always better to get proper guesses on models' parameters (by doing simulations) to get a relative well fitting before conducting estimation to get relative . I found it difficult to get proper ones especially there are large number of parameters to estimate. For some, e.g., alpha and rho in my example, we have some priors, but others, like a, b, c, d, and r, we absolutely have no clues (note that the value of r is much larger than others in my example). Is there any experience to deal with this?

Second, I'm wondering how to calculate the interval confidence of the parameters. I have looked through all the tutorials, but haven't found any clue.

ind_case.csv ind_covar.csv

kingaa commented 2 years ago

@stevenhuyi: you ask a number of interesting questions.

Although the overall fitting is relatively well the diagnostic test indicated that the 5% quantile, acf[1] and acf[3] of original data were far away from those of simulations (as follows)？I'm wondering if the fittting can be further improved?

From the most general perspective, one can always seek to improve the match between model and data. Whether it is worthwhile to do so depends on your purposes, which we have not discussed. From a narrower perspective, what does the discrepancy between model prediction and data, as revealed in these three summary statistics, tell us?

The low 5th-percentile statistic says that the data go to levels considerably lower than the model predicts. Why does the model not reach such low levels? The discrepancies in the ACFs suggest different frequencies of fluctuations in the model and the data. By thinking about the relation between these aspects and the model, you might come to an understanding of how constraints on the model are preventing it from doing a better job explaining the data. Whether it is worthwhile relaxing these constraints is another question: again, the answer depends on your purposes.

First, it is always better to get proper guesses on models' parameters (by doing simulations) to get a relative well fitting before conducting estimation to get relative . I found it difficult to get proper ones especially there are large number of parameters to estimate. For some, e.g., alpha and rho in my example, we have some priors, but others, like a, b, c, d, and r, we absolutely have no clues (note that the value of r is much larger than others in my example). Is there any experience to deal with this?

Is your first sentence correct as written? I don't understand it. Is it meant to be a question?

It's important to note that it's rarely the case that you have absolutely no clue about a parameter's value. What are the relevant scales in the problem? Are there edge cases? Since you've included the parameter, it must have a meaning to you. Given that meaning, what is a plausible range?

More generally, you put your finger on the fundamental problem of parameter estimation: parameter spaces in interesting models are essentially always of inconveniently high dimension. There is nothing one can do about this but search carefully and document your search. Two reflections on this may be useful to you.

Remember the 'no-free-lunch principle'. At best, pomp helps you translate hard scientific problems into hard geometry problems.
Science is a social enterprise. When faced with a difficult problem, we do the best we can and are careful to document what we did. Then others can come after and make improvements efficiently. In the present context, note that no matter how hard you search, there is always the possibility that a better solution has eluded you. However, if you document your search well, those who follow you will not have to repeat your efforts.

For computing confidence intervals, we typically recommend profile likelihood. For a discussion of this and examples, see the SBIED short-course materials. E.g., here and here.

kingaa commented 2 years ago

I am closing this issue for now. Feel free to reopen should more discussion be warranted.

stevenhuyi commented 2 years ago

@kingaa thanks for your patience and explanations.

Whether it is worthwhile to do so depends on your purposes, which we have not discussed.

The primary goal of my experiment is investigate whether the intensity of the hand foot and mouth disease (HFMD) incidence (shown in ind_case.scv file) is driven by the two components of climatic variable (which are the result of principal component analysis). Considering your suggestion of "how constraints on the model are preventing it from doing a better job explaining the data", I tried two forms of seasonal variation of infection rate

seas1 = k + r*exp(a*PC1 + b*PC2); // seasonal variations
seas2 = exp(k + a*PC1 + b*PC2); // seasonal variations

the probe tests for the two infection rate are as follows:

But some diagnostic statistics are still not good. Honestly, I am not quite sure about "the constraints that prevent my model from explaining the data well". Would you please give me some clues?

Another concern is that it is really time-consuming to explore the global search for the optimal parameter set (i.e., almost 12 hours in my coding) for one city. In order to investigate my assumption of climatic-factor-driven incidence of HFMD, I have to conduct the test in other cities. I have HFMD incidence data for 92 cities located in East China. Is there any way to alleviate such huge calculation? Thanks very much again.

kingaa commented 2 years ago

@stevenhuyi: Unfortunately, I have limited time available to help you reason about your model and your data. You are much more intimately familiar with them than I am. My advice to you is twofold:

First, keep in mind that no model will ever be perfect. If you set yourself the goal of finding a model for which diagnostic statistics $X_1, \dots, Xn$ are consistent with the data and doing so within specific time and expense budgets, there are two possibilities: you will succeed or you will fail. Should you succeed, you will have to face the fact that you might have failed had you included features $X{n+1}, \dots, X_{m}$ in your search. Should you fail, you would not be certain that you might not have succeeded had you increased your budgets by some amount. This is the context in which my remarks above about science being a social enterprise are relevant: do what you can; explain, document, and archive what you have done in a clear-eyed, honest, and self-critical way; describe the prospects for future progress to the best of your ability.

Second, your goal in interpreting a diagnostic statistic—which quantifies one feature of the data—is to understand what aspects of the model structure are preventing the model from agreeing with that feature of the data. This may or may not be obvious to you: if it is not, it will drive you to understand your model better. Once you do understand this, you have a choice as to whether it is necessary or desirable to modify the model to alleviate this constraint. Here again, keep in mind that no model will ever be perfect, nor is perfection necessary for scientific progress.

How do you go about learning which features are within the model's repertoire, which are not, and what the trade-offs between features? The theories of nonlinear dynamical systems and of stochastic processes have much to say in this regard, and it is worthwhile to study them. Too, simulation is an indispensable tool. You can simulate to explore empirically what changes of parameters lead to changes in one probe, and how these changes affect other probes and the likelihood itself. You can plot trajectories of state variables to understand the story the model is telling about how the data arose. You can also plot individual terms of your model equations to interrogate the model's story more closely.

Finally, you mention your concerns about computational expense. Simulation-based methods are powerful, but they are not cheap. If it makes you feel any better, we do not consider it unusual to spend several days fitting a complex model to data, using hundreds of processors on a high-performance computing cluster. As you think about scaling your computations up to the 92 cities, you will certainly want to parallelize your calculations. See the SBIED course materials (starting with Lesson 3) for many examples. Depending on your questions, you may want to think about using panelPomp or spatPomp.

stevenhuyi commented 2 years ago

@kingaa Thanks for your instant and detailed reply every time.

First, keep in mind that no model will ever be perfect. ...Should you succeed, you will have to face the fact that you might have failed had you included features in your search. Should you fail, you would not be certain that you might not have succeeded had you increased your budgets by some amount. This is the context in which my remarks above about science being a social enterprise are relevant: ....

Got it. I used to focus too much on those probe statistics and always hope they are all good. This time, I've comapre the model simulations with data as follows:

it seems that if the simulations can be shifted left a little bit the model can well explain the variation of the data. Perhaps, one resolution is to try to investigate the lagged effects of climatic factors on driving the incidence.

See the SBIED course materials (starting with Lesson 3) for many examples. Depending on your questions, you may want to think about using panelPomp or spatPomp

Thanks, @kingaa. I will surely have a try.

kingaa commented 2 years ago

Very good observation. As you say, perhaps your seasonality predictors are not as good as you had hoped.

One suggestion: supplement the envelope plot you made above with a plot showing the data and 10 or so random simulations. This can reveal things missed by the envelope plot.

kingaa commented 2 years ago

Closing this issue now. Feel free to re-open if more discussion is warranted.