Trajectory Fitting issue

kingaa / pomp

R package for statistical inference using partially observed Markov processes

https://kingaa.github.io/pomp

GNU General Public License v3.0

110 stars 26 forks source link

Trajectory Fitting issue #149

Closed catherinebyrne closed 2 years ago

catherinebyrne commented 3 years ago

Hello,

I am trying to work through a pomp model vignette that involves fitting ecological data to a system of ODEs.

https://github.com/kingaa/pomp/issues/51

I'm really struggling with the trajectory matching part as the function traj.match that is used in this model I believe is depreciated. In the example I'm referring to, the code is:

po %>% traj.match(start=c(K1=180,K2=100,N1_0=10,N2_0=10,r1=0.3,r2=0.4,
                          a12=0.3,a21=0.4,theta=0.9,p=0.8),
                  est=c("r1","r2","a12","a21","theta","p"),
                  method="Nelder-Mead",transform=TRUE,maxit=1000) -> fit

How should this be written now? I think the function is now traj_objfun(), but I can't get it to work. Any help would be greatly appreciated!

Thanks

kingaa commented 3 years ago

The workflow as of pomp version 2 is to construct an objective function that is then optimized using any general-purpose optimization tool. The optimization function created is stateful in that it remembers the point at which it was last evaluated.

For example, the following fits four parameters from an SIR model to simulated data (the model and data are those from the example included in the package).

library(pomp)

enames <- c("gamma","iota","rho","k")

sir() |>
  traj_objfun(
    dmeasure=Csnippet("
      lik = dnbinom_mu(nearbyint(reports),1/k,rho*cases,give_log);
    "),
    rmeasure=NULL,
    params=c(coef(sir()),k=1),
    partrans=parameter_trans(
      log=c("gamma","iota","k"),
      logit="rho"
    ),
    est=enames,
    paramnames=enames,
    statenames="cases"
  ) -> ofun

theta <- c(gamma=10,iota=1,rho=0.2,k=1)

library(subplex)
subplex(
  fn=ofun,
  par=partrans(as(ofun,"pomp"),theta,dir="toEst"),
  control=list(trace=0)
) -> fit

fit
ofun(fit$par)

coef(ofun)

library(ggplot2)

ofun |>
  trajectory(format="d") |>
  ggplot(aes(x=time,y=coef(ofun,"rho")*cases))+
  geom_line()+
  geom_point(
    data=as(ofun,"data.frame"),
    mapping=aes(x=time,y=reports)
  )+
  theme_bw()

The chunk beginning with sir() constructs the pomp object and pipes it to traj_objfun, which constructs the objective function. Note that I have here modified the measurement model from the example that comes with the package. Parameter transformations are provided for each of the four parameters which are to be estimated. theta is a (not very good) guess as to the unknown parameters, which will be used as a starting place for the search. The next chunk (beginning with library(subplex)), uses subplex to maximize the likelihood via trajectory-matching. The result is stored in fit. I then examine fit, call ofun one last time at the estimated parameters (just in case), and then plot the data and the expected value of the data under the fitted model on the same axes.

kingaa commented 3 years ago

I've tried to streamline your code down to focus on the essentials. Since you are doing trajectory matching, your rproc is neither here nor there. I don't have dat3, so I can't verify if this works properly. If the thread continues, please upload that (or something that acts like it; it isn't necessary to include the actual data) with your reply.

library(pomp)

model <- vectorfield(
  Csnippet("
            DVs = (beta-c*T-alpha)*Vs;
            DVb = alpha*Vs+(beta-c*T)*Vb;
            DT = (sigma*(Vs+Vb)-m)*T+p;"
  )
)

init <- Csnippet("
                 Vs_r = 200;
                 Vb_r = 0;
                 background = 3000;
                 Vs = 3020;
                 Vb = 3000;
                 T = 20;
                 ")

rmeas = Csnippet("
            N_Vs = rpois(Vs*rho);
            N_Vb = rpois(Vb*rho);")

dmeas <- Csnippet("
            lik = dpois(N_Vs,Vs*rho,1)+
                   dpois(N_Vb,Vb*rho,1);
            lik = (give_log) ? lik : exp(lik)")

Note that the dmeas as you had written it was incorrect. In the case that give_log=1 it would return the correct answer (the sum of the log likelihoods), but if give_log=0, it would return nonsense (the sum of the likelihoods).

dat3 |>
  traj_objfun(
    times="time", t0=0,
    skeleton=model,
    rmeasure=rmeas,
    dmeasure=dmeas, 
    rinit = init,
    partrans=parameter__trans(log=c("beta","c","alpha","sigma","m","p")),
    paramnames=c("beta","c","alpha","sigma","m","p","rho","shape_fit","rate_fit","factor","area_sg","area_b"),
    statenames=c("Vs_r","Vb_r","Vs","Vb","T","background"),
    est=enames,
    params = c(
      beta=0.7,c = 0.01,alpha = 0.01,
      sigma = 0.0002, m = 0.05, p = 1,rho = 0.9,
      factor = factor,area_sg = area_sg, area_b = area_b,
      shape_fit = shape_fit, rate_fit = rate_fit
    )
  ) -> ofun

I am not sure where the warnings are coming from. You are postulating a Poisson measurement model: are the data all integers, as this model assumes? If not, this may be your problem. If they are, then perhaps you can include the data (dat3 or equivalent), and we can do some more detailed debugging.

kingaa commented 3 years ago

One thing does occur to me: you define Vs_r, Vb_r, and background as latent states, but you do not provide any derivatives. I suppose you are assuming that, left unspecified, these derivatives are effectively zero. This is not a safe assumption. In general, uninitialized memory locations contain undefined values. I would suggest either:

model <- vectorfield(
  Csnippet("
            DVs = (beta-c*T-alpha)*Vs;
            DVb = alpha*Vs+(beta-c*T)*Vb;
            DT = (sigma*(Vs+Vb)-m)*T+p;
            DVs_r = DVb_r = Dbackground = 0;"
  )
)

which ensures that these state variables do not change their values. More elegantly (and efficiently), you could make these three quantities into model parameters (or globals, using the globals argument (see the documentation).

catherinebyrne commented 3 years ago

Your responses have been very helpful! I have the model working! Another issue has arisen though. So far I have been fitting to data capturing Vs and Vb over time (measured daily). However, I also have data capturing T, but this is not measured as often (every 4 days). Is there a way to include all the data when doing the fitting?

I tried to follow what was written in https://kingaa.github.io/pomp/FAQ.html under the heading "3.3 How do I deal with missing data?", but this doesn't seem to work. This is how I rewrote dmeasure:

dmeas <- Csnippet("
  if (ISNA(N_T)) {
    lik = dpois(N_Vs,Vs*rho1+1e-6,1)+
          dpois(N_Vb,Vb*rho1+1e-6,1);
     lik = (give_log) ? lik : exp(lik);
  } else if(ISNA(N_Vb)){
     lik = dpois(N_T,T*rho2+1e-6,1);
     lik = (give_log) ? lik : exp(lik);
  } else {
     lik = dpois(N_Vs,Vs*rho1+1e-6,1)+
             dpois(N_Vb,Vb*rho1+1e-6,1)+
             dpois(N_T,T*rho2+1e-6,1);
     lik = (give_log) ? lik : exp(lik);
  }
")

It seems like since there are NAs in some of the N_T data, it ignores N_T entirely.

I've attached an example of what the data looks like.

dat_with_nas.txt

kingaa commented 3 years ago

I don't immediately see anything wrong with your code. Do you see error messages? On what basis do you conclude that N_T is being entirely ignored?

kingaa commented 3 years ago

I also note that your data variables N_Vs and N_Vb are not integers, as your model assumes. What are these variables? Why have you selected a Poisson measurement model?

catherinebyrne commented 3 years ago

Yes, you are right that N_Vs and N_Vb are not integers. We are infecting mice with a luciferase-tagged virus that emits light, and putting these mice in a bioimager. The values represent the amount of light emitted by the virus in units of photons/s. For simplicity, I'm assuming that each virus emits 1 photon/s. There is also some background signal in this measurement which I am also trying to account for. Perhaps a gamma distribution would be better. I've definitely been having some issues with writing an appropriate dmeasure and rmeasure.

N_T represents CD8 T cell amounts.

I think the issue I was having with fitting is not only due to perhaps a poorly written dmeasure expression, but also due to the fact that the values of N_Vs/N_Vb and N_T are on vastly different scales. Because the viral load measures are so large, it leads to large errors between the data and model-predicted values that outweigh any error related to T cells. As a result, the model fits the viral data well, and the T cell data poorly. Could this be what's happening?

kingaa commented 3 years ago

The Poisson is a discrete distribution: It is defined only on the non-negative integers. I think it would be wise were you to give some more thought to the measurement model. For example: what are the errors you expect to see on these measurements? Do you think the error is better conceived of as additive or multiplicative?

I think the issue I was having with fitting is not only due to perhaps a poorly written dmeasure expression, but also due to the fact that the values of N_Vs/N_Vb and N_T are on vastly different scales. Because the viral load measures are so large, it leads to large errors between the data and model-predicted values that outweigh any error related to T cells. As a result, the model fits the viral data well, and the T cell data poorly.

The weight given to each distinct data stream is a function of the distribution of errors. If you are assuming that the error on the fluorescence values is small or large, then these will be given heavy or light weights, respectively. In other words, it is not the magnitudes per se but the scaling of the error relative to this magnitude (e.g., the coefficient of variation) that determines the relative weighting of the data streams.

Could this be what's happening?

It might, but I have no way of answering this question without more information from you.

kingaa commented 2 years ago

I'm closing this issue now, but feel free to re-open if more discussion is warranted.