Guidelines for choosing rw.sd values

I'm not sure how to choose the values for the rw.sd argument in the mif2 function. My understanding is that this choice might affect the performance of the iterative filtering and while I can choose some values arbitrarily, I was wondering whether there was any guidelines on the magnitude of the sd compared to the parameter value and when it's to our advantage to only change values before a certain time using ivp.

Many thanks!

Marie

Good question, @MarieAugerMethe, and another excellent candidate for the FAQ! To explain, it's necessary to give some background on how the algorithm works. Even if you understand this already, the text here can be a rough draft of the FAQ entry that will result. As ever, your feedback will help me help not only just you yourself, but other users too, by helping me improve the FAQ and other documentation.

IF2 (implemented as mif2 in pomp) essentially iterates a particle filter. It does this Nmif times. Crucially, however, it does this not using the exact model you've specified in your pomp object, which has parameters that are constant in time. Instead, it filters a model in which the time-constant parameters have been replaced by parameters that take a random walk with intensities given by the rw.sd argument. In other words, at each time-step, a random value (normally distributed in the current implementation) with mean zero and sd set by rw.sd is added to the parameter's value. As the iterations proceed, the intensities are gradually drawn down (according to the "cooling schedule" specified by cooling.type and cooling.factor.50). The theory tells us that the swarm of particles will then concentrate around the MLE eventually, i.e., with sufficiently many iterations if the cooling is not too quick. All this is explained in the IF2 paper.

Some insight into the algorithm's effectiveness is afforded by considering that the effect of the added perturbations is to smooth the likelihood surface. This smoothing out of the local structure is a double-edged sword. On the one hand, it makes the large-scale structure of the surface plainer, which is useful when one is far from the MLE. On the other hand, it makes it impossible to find the exact MLE when it is close by.

The key thing to understand is that there are two parameter-space scales at work. First, there is the scale dictated by the distance between your starting guess and the MLE. This can be hard to know a priori, obviously. Second, there is the scale over which the log likelihood function changes appreciably. [Recall that a unit change in the log likelihood is considered "appreciable".] Again, this can be hard to know a priori and it can be very different in different regions of parameter space. In particular, it can be quite different in the lowlands far from the MLE than it is in the high country near the maximum of the likelihood. Moreover, there is a tension between these scales: If you choose the random-walk intensities too small, it will take more IF2 iterations to reach the MLE; If you choose it too large, the random-walk noise will obscure the fine-structure of the likelihood surface.

So, back to your question: What to do? My usual practice is to follow a rule of thumb. Since parameters in the sorts of models I have worked with tend to be rates (or probabilities), it is plausible to imagine that multiplicative perturbations on the order of a few percent to the rates (or odds ratios) will lead to relatively small effects on the model behavior. Of course, this is famously not the case in general! Nevertheless, it suggests perturbing the rates (or odds ratios) on the log scale with random increments on the order of a few percent. The idea here is to err on the side of small perturbations, counting on cheap computing power to perform the IF2 iterations needed to achieve the MLE. At any rate, this is the reasoning behind the usual choice I make of setting rw.sd to a few percent (say 0.01–0.05), having introduced parameter transformations (log transforms for rates, logit transforms for probabilities) via the toEstimationScale and fromEstimationScale arguments of pomp. In my experience (which is of course shaped by the sorts of models and data I have worked on), IF2 rapidly climbs the likelihood surface, despite my intention of choosing a "small" rw.sd. In subsequent iterations, I often find it useful to reduce the random-walk intensities further, as attention shifts to the fine structure in the highlands of the likelihood surface.

The second part of your question has to do with identifying the parameters for which one wants to modify the random perturbations. Commonly, one has some parameters that affect only the initial values (i.e., the values at t = t₀) of the latent Markovian state process. Clearly, it is useless to perturb these parameters when t > t₀. Indeed, it's worse than useless since the perturbations will impede the progress of such parameters toward their ML values. Declaring such parameters to be initial value parameters using ivp results in perturbations that are only applied at t = t₀. More generally, there may be parameters for which perturbations should only be applied at certain times. The rw.sd function allows one to specify (via a simple R expression in terms of time t) which parameters fall into this category and when and how much they are to be perturbed.

I hope this helps. As I say, let's work toward refining this advice into a useful contribution to the FAQ!

Thank you @kingaa!

Again what a fantastic answer. From the paper, I had a similar understanding of IF2, but your description clarifies many points for me and I had completely missed the information on the scale of the parameters. Just a few things to confirm (both are maybe slightly tangents, sorry).

This is more curiosity, my understanding is that the perturbation was the difference between IF2 and IF1, but mif also has rw.sd.

A more important question, is whether we should set the values in the rw.sd at the scale of the transformed or untransformed parameters (and more generally whether we should do the optimization on the transformed or untransformed scale)? I think the default is transform = FALSE, but from your description above, it sounds like you do your optimization on the transformed scale. Is that what you would recommend (I guess thinking out loud, that would be the only way to constrain the parameters)? From the help file, it sounds like the values in rw.sd will be in the transformed scale if transform = TRUE, but what about the start values, should they be in the transformed scale too?

Thank you again!

Marie

This is more curiosity, my understanding is that the perturbation was the difference between IF2 and IF1, but mif also has rw.sd.

No, both IF1 and IF2 use perturbations of the parameters, drawn down gradually. In fact, these are quite similar between the two. The chief differences between IF1 and IF2 are:

If IF2, the full population of particles surviving to the end of each particle-filter iteration is passed to the next iteration. In IF1, a point estimate was derived using an approximation to the gradient of the log likelihood at the previous point. It turns out that the IF2 rule is both simpler and accurate to higher order! How often does that happen?
The theoretical justification is entirely different. In the IF1 paper, it is shown that one can approximate the gradient of the log likelihood function and that hill-climbing moves based on this approximate gradient converge to the MLE under regularity conditions. By contrast, the IF2 paper introduces the notion of "Bayes maps" (i.e., mappings of the space of probability distributions over parameter space into itself corresponding to applications of Bayes' theorem) and shows that the iteration of these Bayes maps converges to a distribution concentrated around the MLE.

As for your second question: thank you again for bringing up an important point I had forgotten to mention. The parameter transformations take the parameters to and from the "estimation scale", i.e., the parameter space visible to the estimation algorithm, whatever it is. In any of the estimation algorithms in pomp, setting the transform argument to TRUE results in the estimation being carried out in this space. For mif2 in particular, the perturbations are additive (and normal) on this scale. Does that make sense?

However, it is confusing to keep track of two scales. For this reason, pomp allows you, the user, to interact with the parameters on the "natural" or "model" scale, i.e., the scale on which you've written the model. In particular, mif2 first transforms the start parameters onto the transformed scale (only if transform=TRUE, of course), does all its work there, including adding perturbations, etc., and then transforms the results back onto the natural scale.

To answer your question as directly as possible: the rw.sd values scale the random perturbations that are additive on the transformed scale. The parameters given in start (and all the outputs) are on the natural scale.

As an aside, it would be nice if pomp would give you a way of specifying the rw.sd values on the natural scale too, wouldn't it? But then how to transform them? A little reflection suggests that things get complicated quickly....

Also, you speculate that parameter transformation is the only method available to enforce constraints on the parameter space. It is the only method for which the user is given facilities, but there is no hindrance to using the barrier method, for example.

By the way, if you can suggest edits to make the help pages more transparent on these points, I would be grateful!

Ok, got it! Thanks @kingaa!

Here is some slight wording changes that I think simplifies the explanation (at least to me) and hopefully are still true. I have bolded the changes. One thing I'm not sure how to make clearer is that the pertubation occurs at the time steps of the model (so if I'm modeling the population size of a species every year, you would get different values for the weight of the particles at each year, right?), while the cooling occurs at the iterations level (right?). I tried my best, but maybe it's more complicated now? And actually, I'm not a 100% whether the starting values of the new iterations is as I described it...

As a note in addition to adding this to the FAQ, which would be fanstatic. I would maybe add a line in the getting started with pomp page under the mif2 box and link that FAQ page

IF2 (implemented as mif2 in pomp) essentially iterates a particle filter. It does this Nmif times. In a simple particle filter, which in partially observed Markov models are used to sample the underlying state (e.g. real population size each year), one would expect the parameter values of the model to stay constant in time. Crucially, however, IF2 filters a model in which the parameters change at each time step via a random walk with intensities given by the rw.sd argument. In other words, at each time-step of the state time series, a random value (normally distributed in the current implementation) with mean zero and sd set by rw.sd is added to the parameter value of the previous time step. As mentioned above, the particle filter itself is iterated multiple times. At each iteration, the new particle filter uses as starting values the final parameter values of the previous iteration. As the iterations proceed, the intensities of the random walk are gradually drawn down (according to the "cooling schedule" specified by cooling.type and cooling.factor.50). The theory tells us that the swarm of particles will then concentrate around the MLE eventually, i.e., with sufficiently many iterations if the cooling is not too quick. All this is explained in the IF2 paper.

The key thing to understand is that there are two parameter-space scales at work. First, there is the scale dictated by the distance between your starting guess and the MLE. This can be hard to know a priori, obviously. Second, there is the scale over which the log likelihood function changes appreciably. [Recall that a unit change in the log likelihood is considered "appreciable".] Again, this can be hard to know a priori and it can be very different in different regions of parameter space. In particular, it can be quite different in the lowlands far from the MLE than it is in the high country near the maximum of the likelihood. Moreover, there is a tension between these scales: If you choose the random-walk intensities too small, it will take more IF2 iterations to reach the MLE; If you choose it too large, the random-walk noise will obscure the fine-structure of the likelihood surface.

So what values should you use for the intesities? My usual practice is to follow a rule of thumb. Since parameters in the sorts of models I have worked with tend to be rates (or probabilities), it is plausible to imagine that multiplicative perturbations on the order of a few percent to the rates (or odds ratios) will lead to relatively small effects on the model behavior. Of course, this is famously not the case in general! Nevertheless, it suggests perturbing the rates (or odds ratios) on the log scale with random increments on the order of a few percent. The idea here is to err on the side of small perturbations, counting on cheap computing power to perform the IF2 iterations needed to achieve the MLE. At any rate, this is the reasoning behind the usual choice I make of setting rw.sd to a few percent (say 0.01–0.05). To be able to do this on the transformed scale of your parameters, you need to have introduced parameter transformations (log transforms for rates, logit transforms for probabilities) via the toEstimationScale and fromEstimationScalearguments of pomp. In the arguments of mif2, you can set transform = TRUE and then the values in rw.sd will be on that transformed scale, but note that the values in start are still on the untransformed scale. In my experience (which is of course shaped by the sorts of models and data I have worked on), IF2 rapidly climbs the likelihood surface, despite my intention of choosing a "small" rw.sd. In subsequent iterations, I often find it useful to reduce the random-walk intensities further, as attention shifts to the fine structure in the highlands of the likelihood surface.

In addition, one has some parameters that affect only the initial values (i.e., the values at t = t0) of the latent Markovian state process. Clearly, it is useless to perturb these parameters when t > t0. Indeed, it's worse than useless since the perturbations will impede the progress of such parameters toward their ML values. Declaring such parameters to be initial value parameters using ivp results in perturbations that are only applied at t = t0. More generally, there may be parameters for which perturbations should only be applied at certain times. The rw.sd function allows one to specify (via a simple R expression in terms of time t) which parameters fall into this category and when and how much they are to be perturbed.

Thanks!!

Marie

~~FYI, I think maybe you said IF2 when you meant IF1 in your previous answer... Do you mean: In IF1, a point estimate was derived using an approximation to the gradient of the log likelihood at the previous point.~~ Corrected now.

Yes, thanks for the correction. I am re-opening this issue to remind me to use it in crafting an FAQ entry.

kingaa / pomp

Guidelines for choosing rw.sd values #72