Prior on the Dirichlet-Multinomial parameter?

iantaylor-NOAA commented 4 years ago

Thanks to @aaronmberger-nwfsc for reminding me to follow up on suggestion 3 in issue #590: "Prior on the Dirichlet-Multinomial parameter". I'm adding further details in this separate issue to keep things tidier, as the discussion under #590 is mostly about the zero-centered recruitment deviations.

A prior was explored to deal with the issue described starting on page 52 of the 2019 assessment report:

Previous MCMC explorations indicated that log(theta_surv), the log of the theta parameter associated with the survey data in the Dirichlet-Multinomial (D-M) weighting, was not being sampled efficiently due to many samples occurring in a part of the parameter space where the effective sample size multiplier, theta_surv/(1+theta_surv), is between 0.99 and 1.0. In this area, the input sample sizes are given full weight and the likelihood surface is almost completely flat with respect to this parameter. Therefore, to improveMCMC convergence, log(theta_surv) was fixed at the MLE estimate of 2.44, corresponding to a weight of theta_surv/(1+theta_surv) = 0.92.

@James-Thorson suggested that a prior on the log(theta) which would be associated with an approximately uniform weight. The 2019 model had no prior on the log(theta) parameters (those with labels "ln(EffN_mult)_1" and "ln(EffN_mult)_2"). These parameters are transformed as shown in Equation (7) of the 2019 hake assessment report: The transformation causes a uniform distribution in the log(theta) space to correspond to a U-shaped distribution in the weight (multiplier) space of theta / (1 + theta) which could be counteracted by a prior.

To develop a proposed prior, I started with a uniform distribution for weight = theta / (1 + theta) ~ U(0,1) and transformed to get a distribution of log(theta) values. This distribution had standard deviation 1.813. A normal prior with that same SD: N(0, 1.813) applied to the log(theta) parameter corresponds to a distributions of weights that is not quite uniform, but much closer: In that figure, the first panel is based on a log(theta) parameter bounded at -5 and 5. An upper bound of 20 results in an even more skewed distribution with lots of weight close to 1.0.

Tests of MCMC using the N(0, 1.813) prior showed almost no impact on the weight associated with the fishery ages and a small reduction in the weight for the survey ages while significantly reducing autocorrelation of the parameter.

Here's a summary of the MLE and MCMC results

MCMC results for median weight for fishery ages has almost zero change: median = 36.63% of input sample size without prior to 36.55% with prior
MLE estimate of weight given to survey ages decreases from 92.0% without prior to 88.1% with prior but impact on model is very small (0.3% change in 2019 biomass)
MCMC results for weight given to survey ages with prior has median = 88.4%, so almost the same as the MLE with prior.
MCMC autocorrelation for the parameter associated with survey age weights is -0.01 with prior compared to 0.15 without.

The only negative that I can see is that the autocorrelation of the M and log(R0) parameters, which was already a bit high, has gone up a little higher, 0.27 to 0.40 for log(R0) based on the first 1554 samples (after same thin and burn-in as 2019 base model). A longer burn-in might help with this as would switching to the ADNuts MCMC approach. Also, estimating the parameter in your 2020 candidate models with and without this prior might show different behavior.

The model, including incomplete MCMC is uploaded here: https://drive.google.com/drive/folders/1ZVveMDA79HKs1u-JTgIC9vQspTOeRv98 I'll update with complete 24-million MCMC results tomorrow morning if not later tonight.

iantaylor-NOAA commented 4 years ago

Notes attached related to figuring out the proposed prior and exploring its impact on the model. Dirichlet-Multinomial_prior_notes.R.txt

iantaylor-NOAA commented 4 years ago

I just uploaded the full 2000 sample results from the 24-million chain into the "mcmc" subfolder within that /models/explorations_from_Ian/2019.03.00_base_normal_DM_prior_1.813/ folder on Google Drive and updated the attached R script Dirichlet-Multinomial_prior_notes.R.txt

There is little change from the incomplete mcmc run summarized above except the autocorrelation in M and log(R0) has gone down a bit, suggesting that there were more issues earlier in the chain than in the later part. Both of those parameters are uncorrelated with the ln(EffN_mult)_2 parameter, so I'm not sure why estimating it would slow down the mcmc mixing.

aaronmberger-nwfsc commented 4 years ago

run using 2019 model has autocorrelation in M at 0.094 and ln(R0) at 0.154, so not too bad

kellijohnson-NOAA commented 4 years ago

I did a small simulation with the base model that uses the prior on the DM parameters. If you multiply the input sample size of the fishery age composition data by 2, 1, or 0.5 in the EM, here are the results of the 2 DM parameters with Fishery on the x axis. Looks like we can properly downweight but b/c the model is limited to never upweight with the DM parameters you can't get back to where you started with an input N lower than the truth. I think we need to think heavily about our input sample sizes for the fishery in relation to what we use as sample sizes for the survey. How do we think those values compare?

hake-assessment_basemodel_inputn_DMpars

aaronmberger-nwfsc commented 4 years ago

In general, it seems like the fishery and survey comps tell us the same thing after about age-3, but the survey particularly helps with age-1 and age-2 (even if a year or two later than the survey itself), because of the ascending limb for fishery selectivity.

So in your figure, the multiple input sample size by 1 = the base model, correct? I don't see the base model set of parameters in the figure (i.e., survey transformed value of ~0.9 with fishery transformed value of ~0.45. Are the transformations calculated as: exp(theta)/(1+exp(theta))?

And just editing this to add that of course it depends on your definition of theta, so using SS parameter output names: exp(ln(EffN_mult)) / (1 + exp(ln(EffN_mult))) :)

James-Thorson commented 4 years ago

I agree that it is important when estimating time-varying parameters to properly measure the input sample size, such that you never weight data more than the fundamental limits imposed by sampling variability.

There's three main ways to measure input sample size.

Bootstrap (Stewart-Hamel 2014)
Normal approximation to the standard design-based calculation (Thorson 2014)
Spatio-temporal estimates of comps (Thorson and Haltuch 2018)

Please tell me how I could help!

On Fri, Jan 24, 2020 at 8:54 AM Aaron Berger notifications@github.com wrote:

In general, it seems like the fishery and survey comps tell us the same thing after about age-3, but the survey particularly helps with age-1 and age-2 (even if a year or two later than the survey itself), because of the ascending limb for fishery selectivity.

So in your figure, the multiple input sample size by 1 = the base model, correct? I don't see the base model set of parameters in the figure (i.e., survey transformed value of ~0.9 with fishery transformed value of ~0.45. Are the transformations calculated as: exp(theta)/(1+exp(theta))?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pacific-hake/hake-assessment/issues/595?email_source=notifications&email_token=AB46UTLBAQ57RNV5W54SU2DQ7MMKLA5CNFSM4KHLJETKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJ3NAGI#issuecomment-578211865, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB46UTNMNC2MCVHUNF7A67LQ7MMKLANCNFSM4KHLJETA .

kellijohnson-NOAA commented 4 years ago

@aaronmberger-nwfsc The simulated data would not have a DM parameter b/c it is assumed that they come from the multinomial distribution characterized by the input sample size of the OM and the expected proportions estimated from that model. This is because I used the SS bootstrap simulator.

aaronmberger-nwfsc commented 4 years ago

thanks Kelli for the clarification - I didn't realize you had an operating model, great! I originally thought this was a 'run-based' simulation/comparison.

kellijohnson-NOAA commented 4 years ago

@James-Thorson Thanks for listing the papers, the second one in particular, I don't think I have read that one yet. As of right now, we are leaning more towards letting the on-the-ground sampling protocols inform input sample size. For example, instead of assuming that hauls have sampled the same number of ages throughout time, we are actually trying to go through sampling protocol manuals and learning that the number of ages collected from a haul has changed over time. Once we sort that out, then we will probably get into more model-based estimates.

Just for fun, I also did the same with two different operating models that have 2dAR selectivity, where the three digit numbers in the labels are the rho parameters for age and time, e.g., 000075 is rho_age of 0.00 and rho_time of 0.75. Where the results appears to be a smaller effective sample size for the fishery when the input sample size is too high. hake-assessment_basemodel_inputn_DMpars

James-Thorson commented 4 years ago

I'm not sure I'm following. The effective sample size is approx. n_input * Theta / (1+Theta). It looks like that product is similar for Nmult = 2 and Nmult=1, indicating that if you erronouesly inflate n_input then the theta goes down to compensate...?

On Fri, Jan 24, 2020 at 9:12 AM Kelli Johnson notifications@github.com wrote:

@James-Thorson https://github.com/James-Thorson Thanks for listing the papers, the second one in particular, I don't think I have read that one yet. As of right now, we are leaning more towards letting the on-the-ground sampling protocols inform input sample size. For example, instead of assuming that hauls have sampled the same number of ages throughout time, we are actually trying to go through sampling protocol manuals and learning that the number of ages collected from a haul has changed over time. Once we sort that out, then we will probably get into more model-based estimates.

Just for fun, I also did the same with two different operating models that have 2dAR selectivity, where the three digit numbers in the labels are the rho parameters for age and time, e.g., 000075 is rho_age of 0.00 and rho_time of 0.75. Where the results appears to be a smaller effective sample size for the fishery when the input sample size is too high. [image: hake-assessment_basemodel_inputn_DMpars] https://user-images.githubusercontent.com/4108564/73088647-80c42000-3e89-11ea-8f11-642652d15aed.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pacific-hake/hake-assessment/issues/595?email_source=notifications&email_token=AB46UTOFD73ZZLXA3MQTNELQ7MOQXA5CNFSM4KHLJETKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJ3OYAY#issuecomment-578219011, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB46UTNYEQMWFLOVBEVCDCDQ7MOQXANCNFSM4KHLJETA .

iantaylor-NOAA commented 4 years ago

Hi @kellijohnson-NOAA Nice work doing this analysis. It raises some good questions about input sample size, but it seems you also were interested in the impact of the prior. How hard would it be to repeat this without the prior?

Regarding the results of your second figure, adding the rho_age > 0 and rho_time > 0 presumably adds stiffness to the time-varying selectivity causing the fishery age comps to be fit less well. So it makes sense to me that the D-M parameters would lead to lower weights. However, it appears that the lower weights are only estimated when you've initially up-weighted the samples, but not for Input N Mult = 1 or 0.5.

I'm guessing that for a given data set, the results for the three Input N Mult will be very similar for these results. How hard would it be to come up with some metric of error in 2020 Spawning Biomass or other quantity of interest for these results?

Lastly, @aaronmberger-nwfsc, I'll work on getting the parameter label to be renamed from "ln(EffN_mult)" to "log(theta)"

Recent post from @James-Thorson: yes, that's my interpretation too, but the degree of going down differs by case and doesn't seem to be exactly the inverse of the multiplier.

kellijohnson-NOAA commented 4 years ago

@iantaylor-NOAA Thanks for the thoughts.

I could redo it without the prior, quite easily. I am not sure if it is worthwhile for hake, given that we are going forward with the prior. What do others think?
I am thinking that there must be something else going besides your initial thoughts b/c the fits to the fishery composition data in the "real" assessment have a maximum residual half the size of the base model when the 2dAR with rho parameters were used.
I use ss3sim to read in the results, so I can basically plot anything you want quite easily
@James-Thorson - you are correct in that I just erroneously inflated and deflated the sample sizes of the fishery age compositions

iantaylor-NOAA commented 4 years ago

Thanks @kellijohnson-NOAA. That's a lot of variability and doesn't seem like much bias in any of the cases (although it's hard to tell where the medians might be from that figure).

The prior would have more influence on a weight that would be estimated close to 1.0 than one that is not, and MLE estimation confirms this in showing that the impact is slightly bigger on the fishery than the survey ages. So I don't see a need for further exploration if you're all comfortable moving forward with the prior.

I clearly don't really understand the impact of the autocorrelation in the semi-parametric selectivity. If the selectivity deviations are indeed autocorrelated, then it would appear that the change in likelihood of similar deviations adjacent in age and time will be smaller when you have rho_age > 0 and rho_time > 0 than if they're fixed at 0. So that would allow you to have more flexibility to deviate from the baseline selectivity overall (but presumably less flexibility to change between years due to the assumed autocorrelation).

Thanks @kellijohnson-NOAA and the JTC for doing these explorations with the Dirichlet-Multinomial likelihood and the semi-parametric selectivity. And thanks @James-Thorson for coming up with them in the first place.

cgrandin commented 4 years ago

Thanks for starting work on this @kellijohnson-NOAA. It will be very interesting to see what happens modeling changes in historic sampling.

Do you want to include some of this work in the assessment, or keep as part of the assessment presentation?

kellijohnson-NOAA commented 4 years ago

I was more thinking presentation rather than stuff for the actual assessment, mainly to inform SRG requests for next year.

andrew-edwards commented 4 years ago

Chris to tease out parameters for parameter tables 17 and 27 (and more?).

andrew-edwards commented 4 years ago

So we actually did estimate $\theta_{fish}$ in 2019, yes? The text just after the chunk that Ian quoted at the top of this Issue (p53 2019 assessment), says:

So the new bit this year is that we have a better prior for $\theta_{surv}$, yes? Last year we used a uniform (-5, 20) prior for both.

andrew-edwards commented 4 years ago

I just added some new text (pasted below). @iantaylor-NOAA and @James-Thorson - can you please confirm if you are okay with 'pers. comm.'. @kellijohnson-NOAA @aaronmberger-nwfsc @cgrandin - please take a read over at some point (at least one of you!). It's in the Data section.

James-Thorson commented 4 years ago

Looks good to me! Thanks for documenting the rationale for this choice

On Tue, Feb 4, 2020, 6:23 PM Andrew Edwards notifications@github.com wrote:

I just some new text (pasted below). @iantaylor-NOAA https://github.com/iantaylor-NOAA and @James-Thorson https://github.com/James-Thorson - can you please confirm if you are okay with 'pers. comm.'. @kellijohnson-NOAA https://github.com/kellijohnson-NOAA @aaronmberger-nwfsc https://github.com/aaronmberger-nwfsc @cgrandin https://github.com/cgrandin - please take a read over at some point (at least one of you!). It's in the Data section. [image: image] https://user-images.githubusercontent.com/6977190/73805391-6b010580-477b-11ea-8eda-2f834ae89d58.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pacific-hake/hake-assessment/issues/595?email_source=notifications&email_token=AB46UTPXZEP6ZEFFOEHWWRDRBIPLZA5CNFSM4KHLJETKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKZ5K2I#issuecomment-582210921, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB46UTJG52HJN5NAFALRXNLRBIPLZANCNFSM4KHLJETA .

iantaylor-NOAA commented 4 years ago

Looks good to me too, and I agree with your interpretation of "this parameter", so you can remove the ** phrase in square brackets. If anyone asks, or you could add additional text to say this, the 1.813 value was determined by taking the standard deviation of the distribution associated with converting a uniform distribution of weights W ~ U(0,1) into the log(theta) scale via the transformation log(theta) ~ log(W / (1 - W) ).

iantaylor-NOAA commented 4 years ago

Closed by accident.

andrew-edwards commented 4 years ago

Thanks Ian. I'll add that text, [ as just `theta = w(1-w) ] and remove the **. Thanks for your extensive investigations - I'll add something extra to the Acknowledgments...

pacific-hake / hake-assessment

Prior on the Dirichlet-Multinomial parameter? #595