Closed stevenhuyi closed 2 years ago
@stevenhuyi: thanks for the detailed report.
The error message arises because the dt
variable you reference is not defined in the context in which the C snippet is executed, except as a function (the probability density function for the Student t-distribution, definitely not what you had in mind, no?). This explains the error: C doesn't know how to multiply (binary *
) a number ('double'
) and a function ('double (*)(double, double, int)'
). The solution is to define dt
inside your C snippet. I can understand that you assumed it would be defined there, as it is in the rprocess snippet, but it is not. Call it a bug, call it a feature, but for a skeleton map snippet, if you want a timestep, you have to define it yourself inside the snippet as well as outside. Looking over the documentation, I see that, while dt
is not mentioned as being available to the C snippet, neither is there anything saying that it isn't. I think I'll take a moment and add one.
Looking at your skeleton C snippet, I notice that it is probably not going to behave the way you intend. You seem to want to model it as a map, but many of the equations read like the specification of a vectorfield. In particular, in the following lines it does seem like the RHS should be proportional to the timestep.
dN[1] = rate[1]*S;
dN[2] = rate[2]*S;
dN[3] = rate[3]*E;
dN[4] = rate[4]*E;
dN[5] = rate[5]*I;
dN[6] = rate[6]*I;
dN[7] = rate[7]*R;
DS = dN[0] - dN[1] - dN[2];
DE = dN[1] - dN[3] - dN[4];
DI = dN[3] - dN[5] - dN[6];
DR = dN[5] - dN[7];
DH = dN[3];
Perhaps you can explain your rationale and I can offer advice?
Thanks for your instant reply, @kingaa. I think I should reply to your second question first. As my equations showed, I should specify a vectorfield in skeletion part. Actually, I didn't know exactly when to specify a vectorfield and when to a map at the beginning although I was sure that my equation is a differential one. Now, I think it should be specified as a vectorfield. Would you please also explained a bit more when to use a map although the documentation showed that it is used in a discrete process.
If the vectorfield is chose, how to define dt
. You said that dt
should be defined as a function in the C snippet. All I intended to do is to supplement the susceptible with the born at each (systematic) step. Hence, I used a births * dt
term. However, if a vectorfield is used, there is no way to define the time step. One point puzzled me is that why I do not have to define the dt
in the C snippet when conduct a stochastic state process as I did in the above m2.
Is it because it is defined in euler(step.fun = Csnippet(seas.seir.stoch),delta.t=1/52/20)
?
This is one thing that is not complicated! The generator of a continuous-time dynamical system is a vectorfield; a map generates a discrete-time dynamical system.
When you specify the skeleton as a vectorfield and ask for a trajectory, pomp calls an ODE integrator from the deSolve package. This routine decides internally how to translate your specification of the rate of various events into the trajectory they imply. I suggest you read up on the documentation of the deSolve package.
You said that
dt
should be defined as a function in the C snippet.
You misunderstood me. I did not say that dt
should be defined as a function, only that it is defined as a totally unrelated function in the R C API, which is visible to every C snippet. If you specify the skeleton as a map, and want to refer to the step size, you must define it yourself (thereby overriding the reference to the p.d.f. of the Student t, which is not what you want). If your skeleton is a vectorfield, there is no stepsize and no reason for one.
In an rprocess snippet, dt
is defined. You cannot control it (it is set by pomp, and can change), so its current value is passed to the C snippet when the snippet is executed.
In an rprocess snippet,
dt
is defined.
I am wondering where dt
is defined, is it defined automatically within rprocess?
You cannot control it (it is set by pomp, and can change),
if so, what's the point of specifying delta.t
in euler(stepfun,delta.t))
?
Finally, I changed my codes as follows to estimate the parameters, however, it turned out some problems. I thought if it is the subjective starting value for parameters causing the problem. Would you help me out. Thanks very much.
rmeas <- "
cases = rnbinom_mu(1/theta, rho * H);
"
dmeas <- "
lik = dnbinom_mu(cases, 1/theta, rho * H, give_log);
"
seas.seir.determin <- "
double schl;
double seas;
double Beta;
double rate[8];
double dN[8];
// term-time seasonality
t = (t-floor(t))365.25;
if ((t>=4&&t<=31) || (t>=60&&t<=180) || (t>=243&&t<=272) || (t>=280&&t<=356.25))
schl = 1.0+amplitude0.2987/0.7013; //the fraction of year children in school is 0.7013
else
schl = 1.0-amplitude;
seas = exp(aPC1 + bPC2);
Beta = seas*schl;
rate[0] = births; // birth
rate[1] = Beta*pow(I + iota,alpha) / pop; // infection
rate[2] = mu; // death from S
rate[3] = delta; // advance from E to I
rate[4] = mu; // death from E
rate[5] = gamma; // recovery from I to R
rate[6] = mu; // death from I
rate[7] = mu; // death from R
dN[0] = rate[0];
dN[1] = rate[1]*S;
dN[2] = rate[2]*S;
dN[3] = rate[3]*E;
dN[4] = rate[4]*E;
dN[5] = rate[5]*I;
dN[6] = rate[6]*I;
dN[7] = rate[7]*R;
DS = dN[0] - dN[1] - dN[2];
DE = dN[1] - dN[3] - dN[4];
DI = dN[3] - dN[5] - dN[6];
DR = dN[5] - dN[7];
DH = dN[3];
"
ind_case %>%
pomp(t0=2009,
time="time",
rinit =Csnippet("
double m = pop/(S_0+E_0+I_0+R_0);
S = nearbyint(mS_0);
E = nearbyint(mE_0);
I = nearbyint(mI_0);
R = nearbyint(mR_0);
H = 0;
"),
skeleton = vectorfield(Csnippet(seas.seir.determin)),
dmeasure=Csnippet(dmeas),
rmeasure=Csnippet(rmeas),
covar=covariate_table(ind_covar, times = "time"),
accumvars="H",
statenames=c("S","E","I","R","H"),
paramnames=c("mu","delta","gamma","alpha","iota","theta",
"rho","amplitude","a","b","S_0","E_0","I_0","R_0")
) -> m1
pars = c(mu = 1/60,delta= 365/5,gamma = 365/8, alpha = 0.8,iota = 0.8, theta = 0.8, ###starting parameters
rho = 0.5, amplitude = 0.5, a = 1.1, b = 1.2, S_0 = 0.056, E_0 = 0.002,
I_0 = 0.002, R_0 = 0.94)
###objective function###
m1 %>%
traj_objfun(
est=c("alpha","iota","theta",
"rho","amplitude","a","b","S_0","E_0","I_0","R_0"),
params=pars,
statenames=c("S","E","I","R","H"),
paramnames=c("mu","delta","gamma","alpha","iota","theta",
"rho","amplitude","a","b","S_0","E_0","I_0","R_0"),
partrans=parameter_trans(log=c("alpha"),
logit=c("rho","amplitude","theta"),
barycentric=c("S_0","E_0","I_0","R_0"))
) -> ofun
library(subplex)
subplex(par=partrans(as(ofun,"pomp"),pars,dir="toEst"),fn=ofun) -> fit
fit
The minimized value of the function is NA.
I am wondering where
dt
is defined, is it defined automatically within rprocess?
Yes, the step size is chosen by the pomp function that is evaluating your rprocess snippet.
what's the point of specifiying "delta.t" in euler(stepfucn,delta.t)) ?
As the documentation states, this is the maximum allowed value of the stepsize. Smaller steps will be taken if an integer number of steps of the specified size do not fit perfectly in the needed interval.
Thanks for your explainations, @kingaa, and they are very helpful. However, the results of my trajectory fitting are as follows which are very strange.
> fit
$par
mu delta gamma alpha iota theta rho
1.2644035 101.6945840 47.3708537 -1.3749715 0.4133130 1.5748155 -0.7518998
amplitude a b S_0 E_0 I_0 R_0
-8.8479107 -11.4714574 10.1847661 1.4489795 -5.2893404 -6.2906773 0.1239168
$value
[1] NaN
$counts
[1] 3925
$convergence
[1] 1
$message
[1] "limit of machine precision reached"
$hessian
NULL
I did not estimate mu
, delta
, and gamma
, but they changed. Estimates of S_0
and R_0
are not reasonable. There also seems a problem of convergence. Thanks for your patience.
Did you notice any warning messages? I did when I ran your code. You should be aware that, in some parameter regimes, your model may be stiff, which makes integration hard. I don't have the time to teach you about all of these things. You can read about numerical solution of ODE (the deSolve manual may be a good place to start). Also, you should be aware that nonlinear optimization is a hard problem, suffering from pathologies such as multiple local minima. It is typically necessary to put more effort into the search for the MLE than simply doing one optimization run.
Finally, it's a good idea to examine your model at the parameters that are causing problems and figure out why there are problems. For example, plug those parameters into trajectory
and examine the result.
Hi, @kingaa . Thanks very much for you advice on estimating parameters in the non-linear dynamic system. I've read a lot of literature on this topic and pomp. Now, I‘m still working on my data. This time I use the mif2
function instead to carry out the parameter estimation as follows:
rmeas <- "
cases = rpois(rho * H);
"
dmeas <- "
lik = dpois(cases, rho * H, give_log);
"
seas.seir.stoch <- "
// declare local variables
double seas, foi, dw, born, mu_SE;
seas = r*exp(a*PC1 + b*PC2 + c*sin(2*M_PI*t) + d*sin(4*M_PI*t)); // seasonal variations
// expected force of infection. iota: imported infections
foi = seas*pow(I+iota, alpha)/pop;
// white noise (extrademographic stochasticity)
dw = rgammawn(sigmaSE,dt);
mu_SE = foi*dw/dt; // stochastic force of infection
// Poisson births: fraction of leak into S from N
born = rpois(births*dt);
// rates for states:
double dN_SE = rbinom(S , 1-exp(-mu_SE *dt));
double dN_EI = rbinom(E , 1-exp(-mu_EI *dt));
double dN_IR = rbinom(I , 1-exp(-mu_IR *dt));
// states update:
S += born - dN_SE;
E += dN_SE - dN_EI;
I += dN_EI - dN_IR;
H += dN_IR;
W += (dw - dt)/sigmaSE; // standardized i.i.d. white noise
"
ind_case %>%
pomp(t0=2009,
time="time",
rprocess=euler(step.fun = Csnippet(seas.seir.stoch),delta.t=1/365),
rinit =Csnippet("
double m = pop/(S_0+E_0+I_0+R_0);
S = nearbyint(m*S_0);
E = nearbyint(m*E_0);
I = nearbyint(m*I_0);
H = 0;
W = 0;
"),
dmeasure=Csnippet(dmeas),
rmeasure=Csnippet(rmeas),
covar=covariate_table(ind_covar, times = "time"),
partrans=parameter_trans(log=c("alpha","iota","sigmaSE","r"),
logit=c("rho"),
barycentric=c("S_0","E_0","I_0","R_0")),
accumvars="H",
statenames=c("S","E","I","H","W"),
paramnames=c("mu_EI","mu_IR","alpha","iota","rho","sigmaSE","r",
"a","b","c","d", "S_0","E_0","I_0","R_0")
) -> m1
####to guess the parameters #########
pars1 = c(mu_EI= 365/5,mu_IR = 365/8, alpha = 0.5,iota = 1,5, rho = 0.15,
sigmaSE = 0.05, a = 0.48, b = 0.3, c = 0.45, d = -0.3, r=140000,
S_0 = 0.012, E_0 = 1.642694e-09,I_0 = 2.156669e-09, R_0 = 0.983)
m1 %>% ###simulate at starting parameters
simulate(params=mle,nsim=100,format="data.frame",include.data=TRUE) %>%
subset(time>0,select=c(time,.id,cases)) %>%
melt(id=c("time",".id")) %>%
ggplot(aes(x=time,y=value,color=ifelse(.id=="data","data","simulation"),
group=.id))+
labs(color="")+ geom_line()+
facet_wrap(~variable,ncol=1,scales="free_y")+theme_bw()
###global search for parameter###
profile_design(mu_EI= 365/5,mu_IR = 365/8, type="sobol",
lower=c(alpha = 0.1,iota = 1, rho =0.1, sigmaSE = 0.01, r = 100,
a = -5, b = -5, c=-5, d = -5,
S_0 = 0.01, E_0 = 5.18335e-09,I_0 = 5.18335e-09,R_0=0.96),
upper=c(alpha = 3,iota = 100, rho = 0.9, sigmaSE=0.5, r = 5000,
a = 5, b = 5, c=5, d = 5,
S_0 = 0.02, E_0 = 5.18335e-08,I_0 = 5.18335e-08,R_0=0.97),
nprof =1000
) -> guesses
#####parallel computation###
library(doParallel)
cl <- detectCores() - 1
registerDoParallel(cl)
library(doRNG)
registerDoRNG(123) ###to repeat our result
library(foreach)
foreach (start=iter(guesses,"row"),
.combine=rbind, .packages=c("pomp"),
.errorhandling="remove", .inorder=FALSE) %dopar% {
tic <- Sys.time()
mf %>%
mif2(params=start) -> mf1
replicate(10, mf1 %>% pfilter() %>% logLik()) %>%
logmeanexp(se=TRUE) -> ll
toc <- Sys.time()
etime <- toc-tic
units(etime) <- "hours"
data.frame(as.list(coef(mf1)),loglik=ll[1],
loglik.se=ll[2],t_elaps = as.numeric(etime))
} -> ests
###more effects to look for the MLE####
ests <- read.csv("ests.csv",header=TRUE,sep=",")
ests %>%
filter(!is.na(loglik)) %>%
filter(rank(-loglik)<=100) %>% ###top 100
select(-loglik,-loglik.se,-t_elaps) -> starts
registerDoRNG(123) ###to repeat our result
foreach (start=iter(starts,"row"),
.combine=rbind, .packages=c("pomp"),
.errorhandling="remove", .inorder=FALSE) %dopar% {
mf %>%
mif2(params=start) %>%
mif2() -> mf_fl
replicate(10, mf_fl %>% pfilter() %>% logLik()) %>%
logmeanexp(se=TRUE) -> ll
data.frame(as.list(coef(mf_fl)),loglik=ll[1],loglik.se=ll[2])
} -> ests_opt
stopImplicitCluster() ###to release the clusters
###model validation###
ests_opt%>%
filter(loglik==max(loglik,na.rm = TRUE)) %>%
as.vector()-> mle
m2<-m1
coef(m2) <- mle
m2 %>%
probe(nsim=200,probes=list(
mean=probe.mean("cases"),
q=probe.quantile("cases",probs=c(0.05,0.25,0.5,0.75,0.95)),
probe.acf("cases",lags=c(1,3),type="corr",transform=log)
)) -> m2_probe
summary(m2_probe)
plot(m2_probe)
Although the overall fitting is relatively well the diagnostic test indicated that the 5% quantile, acf[1] and acf[3] of original data were far away from those of simulations (as follows)?I'm wondering if the fittting can be further improved?
Additionally, I still have two further concerns:
First, it is always better to get proper guesses on models' parameters (by doing simulations) to get a relative well fitting before conducting estimation to get relative . I found it difficult to get proper ones especially there are large number of parameters to estimate. For some, e.g., alpha and rho in my example, we have some priors, but others, like a, b, c, d, and r, we absolutely have no clues (note that the value of r is much larger than others in my example). Is there any experience to deal with this?
Second, I'm wondering how to calculate the interval confidence of the parameters. I have looked through all the tutorials, but haven't found any clue.
@stevenhuyi: you ask a number of interesting questions.
Although the overall fitting is relatively well the diagnostic test indicated that the 5% quantile, acf[1] and acf[3] of original data were far away from those of simulations (as follows)?I'm wondering if the fittting can be further improved?
From the most general perspective, one can always seek to improve the match between model and data. Whether it is worthwhile to do so depends on your purposes, which we have not discussed. From a narrower perspective, what does the discrepancy between model prediction and data, as revealed in these three summary statistics, tell us?
The low 5th-percentile statistic says that the data go to levels considerably lower than the model predicts. Why does the model not reach such low levels? The discrepancies in the ACFs suggest different frequencies of fluctuations in the model and the data. By thinking about the relation between these aspects and the model, you might come to an understanding of how constraints on the model are preventing it from doing a better job explaining the data. Whether it is worthwhile relaxing these constraints is another question: again, the answer depends on your purposes.
First, it is always better to get proper guesses on models' parameters (by doing simulations) to get a relative well fitting before conducting estimation to get relative . I found it difficult to get proper ones especially there are large number of parameters to estimate. For some, e.g., alpha and rho in my example, we have some priors, but others, like a, b, c, d, and r, we absolutely have no clues (note that the value of r is much larger than others in my example). Is there any experience to deal with this?
Is your first sentence correct as written? I don't understand it. Is it meant to be a question?
It's important to note that it's rarely the case that you have absolutely no clue about a parameter's value. What are the relevant scales in the problem? Are there edge cases? Since you've included the parameter, it must have a meaning to you. Given that meaning, what is a plausible range?
More generally, you put your finger on the fundamental problem of parameter estimation: parameter spaces in interesting models are essentially always of inconveniently high dimension. There is nothing one can do about this but search carefully and document your search. Two reflections on this may be useful to you.
For computing confidence intervals, we typically recommend profile likelihood. For a discussion of this and examples, see the SBIED short-course materials. E.g., here and here.
I am closing this issue for now. Feel free to reopen should more discussion be warranted.
@kingaa thanks for your patience and explanations.
Whether it is worthwhile to do so depends on your purposes, which we have not discussed.
The primary goal of my experiment is investigate whether the intensity of the hand foot and mouth disease (HFMD) incidence (shown in ind_case.scv
file) is driven by the two components of climatic variable (which are the result of principal component analysis). Considering your suggestion of "how constraints on the model are preventing it from doing a better job explaining the data", I tried two forms of seasonal variation of infection rate
seas1 = k + r*exp(a*PC1 + b*PC2); // seasonal variations
seas2 = exp(k + a*PC1 + b*PC2); // seasonal variations
the probe tests for the two infection rate are as follows:
But some diagnostic statistics are still not good. Honestly, I am not quite sure about "the constraints that prevent my model from explaining the data well". Would you please give me some clues?
Another concern is that it is really time-consuming to explore the global search for the optimal parameter set (i.e., almost 12 hours in my coding) for one city. In order to investigate my assumption of climatic-factor-driven incidence of HFMD, I have to conduct the test in other cities. I have HFMD incidence data for 92 cities located in East China. Is there any way to alleviate such huge calculation? Thanks very much again.
@stevenhuyi: Unfortunately, I have limited time available to help you reason about your model and your data. You are much more intimately familiar with them than I am. My advice to you is twofold:
First, keep in mind that no model will ever be perfect. If you set yourself the goal of finding a model for which diagnostic statistics $X_1, \dots, Xn$ are consistent with the data and doing so within specific time and expense budgets, there are two possibilities: you will succeed or you will fail. Should you succeed, you will have to face the fact that you might have failed had you included features $X{n+1}, \dots, X_{m}$ in your search. Should you fail, you would not be certain that you might not have succeeded had you increased your budgets by some amount. This is the context in which my remarks above about science being a social enterprise are relevant: do what you can; explain, document, and archive what you have done in a clear-eyed, honest, and self-critical way; describe the prospects for future progress to the best of your ability.
Second, your goal in interpreting a diagnostic statistic—which quantifies one feature of the data—is to understand what aspects of the model structure are preventing the model from agreeing with that feature of the data. This may or may not be obvious to you: if it is not, it will drive you to understand your model better. Once you do understand this, you have a choice as to whether it is necessary or desirable to modify the model to alleviate this constraint. Here again, keep in mind that no model will ever be perfect, nor is perfection necessary for scientific progress.
How do you go about learning which features are within the model's repertoire, which are not, and what the trade-offs between features? The theories of nonlinear dynamical systems and of stochastic processes have much to say in this regard, and it is worthwhile to study them. Too, simulation is an indispensable tool. You can simulate to explore empirically what changes of parameters lead to changes in one probe, and how these changes affect other probes and the likelihood itself. You can plot trajectories of state variables to understand the story the model is telling about how the data arose. You can also plot individual terms of your model equations to interrogate the model's story more closely.
Finally, you mention your concerns about computational expense. Simulation-based methods are powerful, but they are not cheap. If it makes you feel any better, we do not consider it unusual to spend several days fitting a complex model to data, using hundreds of processors on a high-performance computing cluster. As you think about scaling your computations up to the 92 cities, you will certainly want to parallelize your calculations. See the SBIED course materials (starting with Lesson 3) for many examples. Depending on your questions, you may want to think about using panelPomp or spatPomp.
@kingaa Thanks for your instant and detailed reply every time.
First, keep in mind that no model will ever be perfect. ...Should you succeed, you will have to face the fact that you might have failed had you included features in your search. Should you fail, you would not be certain that you might not have succeeded had you increased your budgets by some amount. This is the context in which my remarks above about science being a social enterprise are relevant: ....
Got it. I used to focus too much on those probe statistics and always hope they are all good. This time, I've comapre the model simulations with data as follows:
it seems that if the simulations can be shifted left a little bit the model can well explain the variation of the data. Perhaps, one resolution is to try to investigate the lagged effects of climatic factors on driving the incidence.
See the SBIED course materials (starting with Lesson 3) for many examples. Depending on your questions, you may want to think about using panelPomp or spatPomp
Thanks, @kingaa. I will surely have a try.
Very good observation. As you say, perhaps your seasonality predictors are not as good as you had hoped.
One suggestion: supplement the envelope plot you made above with a plot showing the data and 10 or so random simulations. This can reveal things missed by the envelope plot.
Closing this issue now. Feel free to re-open if more discussion is warranted.
I am trying to use an SEIR model in pomp to fit my observed seiries of HFMD data. I encounter an error when constructing the skeleton argument and my codes are as follows:
here comes the error:
As the births in the covariate dataframe is the annual accumulated number of the born, births * dt should be the number of individuals entering the susceptible at each dt step. What type is the "dt" variable? it seems that dt can not be used in binary calcualtion. I also changed it to a stochastic process using following code, but it works. Would you please tell me why? Thanks in advance.
ind_case.csv ind_covar.csv