[Bug]: Possible bug relating to the estimation of Q_extra_SD when doing NUTS MCMC

alexpbell commented 7 months ago

Describe the bug

Inability to find suitable epsilon in hybrid monte carlo epsilon search when attempting to use NUTS MCMC. SS3 input files have been provided to @Rick-Methot-NOAA.

To Reproduce

ss3_win.exe -hbf -nuts -mcmc 15

Expected behavior

MCMC iterations to proceed

Screenshots

Screenshot for the command above:

In this screenshot we have -verbose_find_epsilon added to the command, showing more detail:

Which OS are you seeing the problem on?

Windows

Which version of SS3 are you seeing the problem on?

3.30.22.1

Additional Context

This issue has not come up for this model when Q_extra_SD is not an estimated parameter, but it is not clear whether this is exclusively a Q_extra_SD issue.
The verbose find epsilon flag output suggests the issue is occurring on the first leapfrog jump where epsilon is at its default value of 1.
In hmc_functions.cpp (admb source) there is this check after the first jump: "if(alpha < 0.5 || std::isnan(alpha)){". Here it seems that we have alpha=Inf rather than NaN because in this case the likelihood is getting vastly better rather than worse (as you might expect).
I am not sure why the likelihood is getting so much better, whether that is meant to be possible, and whether upgrading this check to include std::isinf might resolve things or whether there is some deeper issue with the input files that is causing this.

alexpbell commented 7 months ago

Tested now on a linux platform (Ubuntu 22.04.4 LTS, intel processor) and same issue.

Rick-Methot-NOAA commented 7 months ago

@kellijohnson-NOAA and all associated with hake. Do you see this happening in your situation?

alexpbell commented 7 months ago

I think what is happening is that when extra s.e. is being estimated for Q, and hbf (hybrid bounded flag) is 1, then in the initiation of the hybrid monte carlo step size search when epsilon is 1, there is a big jump through "unbounded" (or at least differently bounded) parameter space such that the s.d. term (Svy_se_use(f,i) for the surv_like component of the likelihood becomes very large: Svy_like_I(f,i) = 0.5 square((Svy_obs_log(f, i) - Svy_est(f, i) + 0.5 square(Svy_se_use(f, i))) / Svy_se_use(f, i)) + sd_offset * log(Svy_se_use(f, i)); so the nll becomes unexpectedly small ("good"). The fact that a positive definite hessian minima is still found suggests the shape of the nll objective function surface is locally concave but globally convex and that some kind of bounds / variation to the bounds when hbf is on are needed for this kind of "extra s.e." parameter.

Cole-Monnahan-NOAA commented 7 months ago

The first thing I notice is that the NLL after optimization (374) does not match the initial NLL during NUTS (31415). If you turn on SS3 printing to see the SSB and others it probably would indicate that you're starting in some unrealistic part of the posterior, potentially with a crashed population. Why this occurs I'm not sure, and I don't remember seeing this before for other models (but this is precisely why I print it out in ADMB). Does the population get very close to 0 at some point? Do you use any dev_vectors? If so I'd turn those off.

I would try a few tests. (1) set it to start from the .par file and use phasing and max phasing to turn off most parameters, but leave estimation of Q_extra_SD on. See how far back you have to strip the model for it to work fine. (2) Try adding the flag -mcdiag to the command (see here). This will tell ADMB to start from a diagonal mass matrix (ignore the admodel.cov file). (3) Try the old RWM algorithm to see what it does. If the initial NLL matches the optimizer then that's really helpful to know. (4) Try specifying initial values (active pars only) using e.g. -mcpin init.txt

My intuition is that this is a model configuration/SS3 issue. Catching serious issues like this in the initial stepsize calcs would probably be prudent. If need be we can file an issue at the ADMB repo. But for now I don't recommend it because ADMB is no longer actively being developed, this is a pretty extreme, and the error message and console output is pretty sufficient to identify where to look for issues.

alexpbell commented 7 months ago

Thanks Cole it seems you've hit on the main underlying issue. I did not expect the NLL to jump like that at the onset of MCMC. The reason appears to be related to dev vectors as you suspect, and the best documentation of the issue seems to be from the starter file section of the SS manual:

"A bias adjustment of 1.0 is applied to recruitment deviations in the MCMC phase, which could result in reduced recruitment estimates relative to the MLE when a lower bias adjustment value is applied."

In this case we have 45 rec devs estimated and a strong bias adjustment ramp applied, and a generally low biomass. The combination produced a strong "bias shock" and reduced relative recruitment at the onset of MCMC such that crash penalties kicked in.

I am finding that the SS3 manual suggestion for handling this ("A small value, called the “bump”, is added to the ln(R0) for the first call to MCMC in order to prevent the stock from hitting the lower bounds when switching from MLE to MCMC.") still creates quite a difference between MLE NLL and MCMC NLL when using nuts (whereas it doesn't seem to with rwm). So I am handling the bias shock by setting my "Max Bias Adjustment" value in rec devs settings to -1, which sets it to 1.0 for all years. I am not using this for any MLE inference, just to prime the nuts MCMC. Let me know if you think there is a better way to go. With this done, nuts appears to be operating correctly with q extra as an estimated parameter, though I still have more testing to do to be sure.

RWM NLL difference due to a small ln(R0) "bump":

NUTS NLL difference with same bump:

NUTS NLL difference with no bump:

alexpbell commented 7 months ago

I should clarify the above screenshots all relate to Max Bias Adjustment=-1 models, so they deal with "other" (non rec dev) bias adjustments and are therefore not testing whether I can find a suitable bump that turns crash penalties off (crash is zero in all these cases) just noticing that small change in ln(R0) seem to make a big difference to the likelihood components in the hbf/nuts case, so I'm staying away from the bump option for now:

Cole-Monnahan-NOAA commented 7 months ago

Note that the concept of dev_vector options for recdevs and the bias ramp are different.

I don't understand why the MLE ramp would be different since SS3 detects the -mcmc flag internally and disables it. The differences could be from using dev_vectors in other model components. These are not compatible with MCMC and should not be used. See the manual for how to change this option.

The NUTS code is robust to initial values so as long as it can find an epsilon and get started then all should be good. It just failed this time b/c the crashed portion of the likelihood space is not reliable. You don't need to (and ideally shouldn't) start from the MLE. So don't worry about those small differences, as long as they're not caused by dev_vectors.

iantaylor-NOAA commented 7 months ago

I had forgotten about the bias adjustment bump option. It was created long before Cole developed adnuts in response to models that crashed when the -mcmc flag replaced the bias adjustment ramp with 100% bias adjustment for all recdevs. Perhaps it was around the time of this Stewart et al. paper: https://doi.org/10.1016/j.fishres.2012.07.003.

I think running the initial estimation with Max Bias Adjustment=-1 is a very reasonable solution and only would NOT work well if the resulting parameter estimates and associated correlation structure led to worse mixing than the MLE without that change. I see Cole's response just now. Yes, adnuts is so efficient that I think poor mixing is unlikely to be a problem.

Cole-Monnahan-NOAA commented 7 months ago

@iantaylor-NOAA yeah thanks. I don't see that option being super useful b/c the user should just optimize with the -mcmc flag, or turn bias correction off completely so it would not be needed.

@alexpbell I also recommend running from R through the adnuts R package. Command line is fine for diagnosing issues but for real runs you can't pass up parallel execution and the tie in with rstan diagnostics and other outputs.

Wasn't there an overhaul of the SS3 manual to include some of this advice? I can't remember where we left off with that.

Rick-Methot-NOAA commented 7 months ago

dev_vector is only used for recruitment deviations. All other parameter deviation vectors are simple devs. Recruitment uses dev_vector because at the time (a very long time ago), it was deemed desirable that the mean recruitment as derived from the spawner-recruitment curve's log(R0) would be the same as the mean recruitment from the time series of recruitments. That gets broken easily when using regular dev vectors and SS3 MAKES NO INTERNAL ADJUSTMENT for the situations in which log(R0) moves one direction while the mean of the recdevs moves the other direction as an offset. MSY-like quantities are based on the spawner-recruitment parameters, not the mean of the recruitments after devs are applied. Note that the SUM of the recdevs is reported as a stepping stone towards such an adjustment.
I think this internal adjustment to base MSY on the arithmetic mean of the derived recruitments should be a priority for SS3 development because we already are advocating that users switch away from the dev_vector option.

nmfs-ost / ss3-source-code