Open marksorel8 opened 5 years ago
FYI, @marksorel8, I'm currently investigating your reprex on the numerical errors in M
under the juv_trap_NB.stan
model. Diagnosing the problem is challenging b/c when the errors occur during sampling as opposed to warmup, and M
and/or M_tot
is being monitored, the summary stats throw an error and no result is returned. Further complicating things, seemingly irrelevant changes to the model in an effort to circumvent this catch-22 evidently change the state of the RNG so that the errors go away. (Not entirely sure what's up w/ that, but I suspect it has to do with floating-point weirdness, which I first learned of thanks to a frustrating Stan situation a couple years ago.) Even more baffling, simply removing M
and M_tot
from the list of parameters to be monitored causes the errors to go away. No idea what's going on there -- this seems to defy the basic programmatic logic of how Stan works.
From the info on
neg_binomial_rng
: "alpha / beta must be less than 2^29". Is it possible to extract all of the alphas and betas that the chain visited and see if alpha/beta exceeded this threshold in any iteration?
The actual errors printed in the Viewer pane are like this:
Exception: neg_binomial_rng: Random number that came from gamma distribution is 1.72119e+015, but must be less than 1.07374e+009 (in 'model2cf869f0741_juv_trap_NB' at line 103)
This refers to the method of generating pseudo-random NB variates using the gamma-Poisson mixture. AFAICT, either of the NB parameters (M_hat * beta_NB
or beta_NB
) could be responsible for these huge gamma RVs. We need a way to monitor these parameters when the errors are occurring to see if a pattern emerges in the "good" vs. "bad" iterations, and that in turn requires tricking the sampler as described above.
Stay tuned...
Aha! I think at least part of the reason I'm having such a hard time reproducing your reprex is because you didn't set the RNG seed for R, thus the inits will be different from one call to the next. I must've gotten "lucky" the first time or two yesterday. Will just have to continue with trial-and-error (as in, keep trying until I get an error).
Oh man, I'm sorry @ebuhle. At first I only set the RNG in R and that didn't work, so I switched to setting it in Stan only, which i thought was enough. SMH. I think my R seed may have been 10403.
See https://github.com/marksorel8/Wenatchee-screw-traps/issues/3#issuecomment-548129741; I found one that "works". Better still, these seeds give two chains that are fine and one (chain 3) that appears to throw the error at the initial values and every iteration thereafter. Nothing obvious pops out from comparing the inits, though. Will have to mess around with a print
statement in generated quantitites
.
An aside, but perhaps worth mentioning. Once upon a time, I had a JAGS model wherein a particular seed had the rather undesirable effect of allowing the MCMC chains to get X steps along (like 75%) before barfing every time at the same iteration. No reboot, update, downdate(?), etc made a difference.
In this particular case, however, the behavior is much more pathological in that there a multiple pathways to the problem. Don't ask me how I know this, but setting really bad bounds on priors (eg, setting the lower bound on an SD prior to be much greater than the true value for the simulated data) can do wonders for the actual fitting process (eg, eliminates all divergent transitions, rapidly decreases run times), even if the answers are obviously wrong.
So, perhaps it's worth changing the prior bound(s) for p_NB
(ie, the scale param for the NB) away from one or both of the current bounds to see if we can eliminate the problem? Right now the range on p_NB
is [0,1].
Perhaps we should try [0.2,0.8] (or something else that's biased and more precise than the current range)?
Once upon a time, I had a JAGS model wherein a particular seed had the rather undesirable effect of allowing the MCMC chains to get X steps along (like 75%) before barfing every time at the same iteration. No reboot, update, downdate(?), etc made a difference.
Haha, even I remember this; that's how annoying it was!
So, perhaps it's worth changing the prior bound(s) for
p_NB
(ie, the scale param for the NB) away from one or both of the current bounds to see if we can eliminate the problem? Right now the range onp_NB
is [0,1].
Not a bad idea, although I'd suggest something without the un-Gelmanian hard endpoints inside the feasible domain, like Beta(2,2). I do have the sneaking suspicion that p_NB
is the culprit, but I was hoping to get some more decisive evidence. I didn't think it would be this difficult, though.
OK, so a bit of print
-statement sleuthing suggests that the problem may be in both M_hat
and beta_NB
.
Typical output in cases that produce the error:
Chain 3: M_hat[1] = 1.31254e+014 beta_NB = 9.99201e-016 M_hat[2] = 1.31131e+014 beta_NB = 9.99201e-016
Chain 3: Exception: neg_binomial_rng: Random number that came from gamma distribution is 3.83462e+011, but must be less than 1.07374e+009 (in 'model2a1471a922e3_juv_trap_NB' at line 102)
Compare that to, for example:
M_hat[53] = 1058.57 beta_NB = 0.311853 M_hat[54] = 1047.58 beta_NB = 0.311853 M_hat[55] = 1316 beta_NB = 0.311853 M_hat[56] = 312.761 beta_NB = 0.311853
It's too bad the "canonical" priors that would regularize p_NB
away from 0 and 1 (like the aforementioned Beta(a,b) where a and b are slightly greater than 1) are also fairly informative about the bulk. We could use something like the Subbotin distribution, as we do in salmonIPM, but it's hacky.
It's less obvious how to prevent M_hat
from blowing up, but maybe tighter priors on the log-scale regression coefs could help:
Anyway, I'm just leaving this update here for now; I'm out of time to spend on this today.
Thank you so so much for looking into this @ebuhle. I will try some different priors and report back!
I leave the exercise to the reader...errr, dissertator.
The tricky part will be making sure you're using a parameterization of the NB that is closed under addition, so that this still holds:
https://github.com/marksorel8/Wenatchee-screw-traps/blob/14e67d4208d652a46a814142b9fe7004d200f711/src/Stan_demo/juv_trap_multiyear.stan#L75
As you probably know, there are several parameterizations in common use, and even the "standard" one in Stan differs from the one on Wikipedia, for example. Further, the most common and useful parameterization in ecology (mean and overdispersion) is yet another variant. The catch is that closure under addition requires that all the NB RVs being summed have the same p (in the standard parameterization), so you can't simply sum the expectations like we do with the Poisson.
Then it will also be slightly tricky to figure out the binomial thinning:
https://github.com/marksorel8/Wenatchee-screw-traps/blob/14e67d4208d652a46a814142b9fe7004d200f711/src/Stan_demo/juv_trap_multiyear.stan#L76
since thinning the NB (like the Poisson) scales the expectation. So you'll have to switch between parameterizations (i.e., solve one set of parameters for the other) at least once.
I would have to sit down with pencil and paper to figure this out. Sounds like a good grad student project, amirite??
Originally posted by @ebuhle in https://github.com/marksorel8/Wenatchee-screw-traps/issues/1#issuecomment-523688629