nmfs-ost / ss3-source-code

The source code for Stock Synthesis (SS3).
https://nmfs-ost.github.io/ss3-website/
Creative Commons Zero v1.0 Universal
37 stars 17 forks source link

allow separate Dirichlet-multinomial parameter for discard vs retained length comp #93

Closed k-doering-NOAA closed 1 year ago

k-doering-NOAA commented 3 years ago

Imported from redmine, Issue #83862 Opened by @iantaylor-NOAA on 2020-10-14 Status when imported: New

All the data-weighting options for composition data (Dirichlet-Multinomial parameters, sample size adjustments, and lambdas) apply equally to all partitions within a fleet. Many west coast groundfish stocks have much higher sample sizes of discards from the observer program than the shoreside sampling of retained fish. This likely leads to the discard data having a greater influence on model results than the retained fish which may not always be appropriate.

Providing a simple mechanism for separate weights by partition would be require some changes to the model input structure, such as

Alternatively, some code could be used with existing input structure to implement this, such as adding 1000 to the fleet number in the variance adjustments factors to have it apply to partition 1 and 2000 to apply to partition 2 (although this applying an adjustment to data only from partition 0 would perhaps require the confusing use of adding 3000). The table in the data file where the Dirichlet-Multinomial parameters are input always has a single row for each fleet, so this would require some other trick to create separate parameters for each partition within the existing input structure.

The assessment team at NWFSC has a strong interest in exploring this possibility, but we could also adjust the input sample sizes as a less elegant approach to testing the impact of separate data-weighting by partition.

k-doering-NOAA commented 3 years ago

comment from @RickMethot on 2020-10-14: I'm sure we can work something out.

  1. I generally much prefer to create a new use within existing inputs. Even if that new use involves opening up a child input process.
  2. Biggest problem is that these data-weights are applied at the very last stage, so after a total logL for that data type has already been created. So, implementing this will require significant modification to how the logL is stored. for example, the length data logL is included in the total by: obj_fun += column(length_lambda,k_phase)*length_like_tot;

so implementation would require making the vectors length_lambda and length_like_tot longer to contain the partition-specific info

k-doering-NOAA commented 3 years ago

comment from @iantaylor-NOAA on 2020-10-14: Rick, Thanks for the quick response. Given that background about the lambdas, I would suggest excluding them from consideration in the initial implementation of this. However, adding this option for both the variance adjustment factors used in iterative tuning for Francis or McAllister-Ianelli weighting and the Dirichlet-Multinomial would allow the different data-weighting methods in common use to be compared without the flexibility of one with regard to partition to impact the choice of method (which continues to be an ongoing debate based on this morning's NWFSC assessment team meeting, hence posting this new issue).

Modifications via variance adjustments factors It looks like the application of the sample size adjustment "mult_by_lencomp_N" input within the "Input variance adjustments factors" section is applied in SS_prelim.tpl and would require changing lines like @ nsamp_l(f,i)=var_adjust(4,f);@ to @ nsamp_l(f,i)=var_adjust(4,f,mkt_l(f,i));@ where the matrix of variance adjustments created in SS_readcontrol_3.30.tpl could be changed from @matrix var_adjust(1,7,1,Nfleet)@ to @3darray var_adjust(1,7,1,Nfleet,0,2)@ where the partition dimension could just be repeated for partitions 0 to 2 unless the user asked for different adjustments by partition.

Modifications to Dirichlet-Multinomial likelihood The sample size adjustment is applied in SS_objfunc.tpl in lines like @ if(Comp_Err_L(f)==1) dirichlet_Parm=mfexp(selparm(Comp_Err_Parm_Start+Comp_Err_L2(f)))nsamp_l(f,i);@ which could be modified to something like @ if(Comp_Err_L(f)==1) dirichlet_Parm=mfexp(selparm(Comp_Err_Parm_Start+Comp_Err_L2(f, mkt_l(f,i) )))nsamp_l(f,i);@ where the Comp_Err_L2 would need to be converted from a vector to a matrix and populated appropriately based on whatever input method we come up with.

k-doering-NOAA commented 3 years ago

comment from @iantaylor-NOAA on 2020-10-21: I updated the assessment team at NWFSC and they remain interested in this option. Everyone was OK with the lambdas remaining as currently configured where they apply equally to all partitions.

Here are some ideas for input methods that would not require any changes to the input unless the new option was requested.

Modifications via variance adjustments factors Using a @10*Factor + partition@ formula would be a way to allow the existing structure to be applied to any partition 0, 1, or 2, such that instead of Factor = 4 = mult_by_lencomp_N you could input Factor = 41 = mult_by_lencomp_N for partition 1 (discard) where Factor = 4 would remain the equivalent of applying the same multiplier to all partitions within a fleet.

The interest at NWFSC is for weighting length and age comps separately by partition, but it's worth considering implementing any change eventually (if not at the same time) for all variance adjustments (except 1=add_to_survey_CV and 2=add_to_discard_stddev for which there is no separate data by partition): 3=add_to_bodywt_CV 4=mult_by_lencomp_N 5=mult_by_agecomp_N 6=mult_by_size-at-age_N 7=mult_by_generalized_sizecomp

Dirichlet-Multinomial likelihood The current input depends on the CompError and ParmSelect columns in the table of specifications for length and age comps (it's not yet available for generalized size comps), where that table has a single row per fleet. I can't think of a good way to add an option for multiple rows per fleet without adding column and it's only the ParmSelect value that could differ among partitions, not any other value in the table. A better approach might be to have a code that triggered some kind of additional input. For instance, a 999 value in the ParmSelect column could indicate that the parameters will be specified by partition in a separate 3-column table as below:

#_mintailcomp addtocomp combine_M_F CompressBins CompError ParmSelect minsamplesize   
-1            0.001     0           0            1         999        0.001    #_Fishery        
-1            0.001     0           0            1         3          0.001    #_Survey

# Specification of Dirichlet-Multinomial parameters for each partition for all fleets that have ParmSelect = 999 above:
#_Fleet Part    ParmSelect      
1       2       1       # Fishery retained
1       1       2       # Fishery discards
iantaylor-NOAA commented 3 years ago

@Rick-Methot-NOAA, if you're working on the D-M likelihood for generalized size comps in issue #66, perhaps you can keep this issue in mind as well.

This issue is present with Pacific Spiny Dogfish where Vlada Gertseva is leading an stock assessment due in April 2021. Dogfish are primarily discarded so the sample sizes for the discards are much larger than for the retained catch. Preliminary models show good fit to discard lengths but bad fit to both the discard ratios and the retained length comps. Flexibility to estimate separate weightings for discards vs. retained comps in the D-M likelihood would be useful to see if that could improve the fit.

The 2011 dogfish assessment used a separate fleet for discard vs. retained fish but the hope for 2021 is to be able to model the retention process directly which should allow for a better treatment of uncertainty among other benefits.

k-doering-NOAA commented 3 years ago

Need to do this without changing the i/o for all users

Rick-Methot-NOAA commented 3 years ago

Thanks for the suggestions on the input options.

Good news is that the logL is already stored for each observation and each observation "knows" whether it is discard, retained, or combined. So the only change is when the individual obs get combined into total logL: if(header_l(f,i,3)>0) length_like_tot(f)+=length_like(f,i); so length_like_tot(f) needs to get a partition dimension. This will be most easily accomplished by stacking the partition info end-to-end inside of longer vectors for length_like_tot. similar dimension addition for length_lambda(f,phase)

Decision point: either stack fleet inside partition, f1 = f + part nfleet or stack partition inside fleet, f1 = part + 1 + (f - 1) 3 where fleet goes from 1 to nfleet and part goes from 0 to 2.

then the total can still use: obj_fun += column(length_lambda,k_phase)*length_like_tot;

k-doering-NOAA commented 2 years ago

Just in case it got lost in the thread, this is needed for the next assessment cycle for west coast groundfish (work starts fall 2022)

Rick-Methot-NOAA commented 2 years ago

Yes. This and DM for size comp are competing for next big change.

Rick-Methot-NOAA commented 1 year ago

@iantaylor-NOAA @kellijohnson-NOAA @chantelwetzel-noaa I'm looking to start this change, which will be very tricky regarding the internal indexing. Question: I am thinking of implementing this option for the D-M method first and perhaps only. Would that be OK?

iantaylor-NOAA commented 1 year ago

Applying first to D-M makes lots of sense in terms of a pilot application to see how much difference we're seeing in the estimated weights. The question of only D-M being satisfactory in the long term could depend on how well we can get by with just D-M for those stocks where the weighting is estimated to be quite different.

In the most recent two cycles, the PFMC groundfish assessments have had a mix of D-M and Francis weighting, where the choice between them is typically driven by subject choices related to the plausibility of the results (e.g. for Big Skate the D-M seemed to work fine, but for Longnose Skate it put too much weight on the indices relative to the comps causing bad residual patterns and less realistic estimates of growth). However, it would be great if D-M worked for every stock because of the automated aspect and having separate weighting for discards and retained comps may cause it to perform better and allow us to move more fully in that direction.

I could see it being frustrating to not be able to compare D-M vs Francis with separate weights for discards and retained, but I supposed a hack to facilitate that comparison if someone thinks it's necessary would be to use the estimated D-M weights for discards vs retained to adjust the input sample size for the discards and then apply the Francis weighting with that fixed ratio embedded into the inputs. Also, if we're stuck with a single weighting option, that's one fewer challenge for the assessors to deal with. Comparisons of multiple methods are encouraged but not required by PFMC SSC: "STATs are encouraged to provide a rationale for the method they select and are encouraged to conduct sensitivity runs with the other methods." (accepted practices guide link)

kellijohnson-NOAA commented 1 year ago

I think that just for D-M is fine Rick. I think that it will be a good proof of concept and force us to really figure out why there are differences between D-M and Francis rather than just defaulting back to what use to be used. I wonder if a little bit of research needs to be done on input sample sizes such that the composition data do not overwhelm the index data?

Rick-Methot-NOAA commented 1 year ago

Looked at the code and realize that the current indexing scheme that allows D-M parameters to be shared by multiple fleets is already complex. Elaborating it to do partition also is going to be challenging. May need a major overhaul of the I/O. Currently, the code expects this input line for each fleet:

_mintailcomp addtocomp combM+F CompressBins CompError ParmSelect minsamplesize

an alternative would be to read list of: fleet, partition, mintailcomp addtocomp combM+F CompressBins CompError ParmSelect minsamplesize -9999 to end reading

code could invoke this alternative approach by imbedding a flag in this current input: 3 # length bin method: 1=use databins; 2=generate from binwidth,min,max below; 3=read vector; with values > 10 indicating to use alternative input

iantaylor-NOAA commented 1 year ago

Hi @Rick-Methot-NOAA, What about the option proposed in this comment from 2020 above (lower part of comment): https://github.com/nmfs-stock-synthesis/stock-synthesis/issues/93#issuecomment-722536124 ?

Rick-Methot-NOAA commented 1 year ago

Thought about that, but it did not lend itself to obvious internal storage solution. Will consider further before proceeding. I think my concern was complications arising from assigning all fleets to a D-M, then going back and adding D-M reference by partition for selected fleets.

Also, a proposed business rule: must read for partition 0 or 2 first, then apply to all partitions then read for partition 2 can be different from partition 0 then read for partition 1 (discards) can be different

iantaylor-NOAA commented 1 year ago

That makes sense. I haven't thought about internal storage at all. We can adapt to whatever design is easiest to implement.

Rick-Methot-NOAA commented 1 year ago

not ready to do a pull request yet, but the I/O for new format is working: 2 # use length composition data (0/1/2) where 2 invokes new comp_comtrol format .........

_Using new list format for composition controls

_fleet partition mintailcomp addtocomp combM+F CompressBins CompError ParmSelect minsamplesize

0 0 0.0001 4e-07 0 0 0 0 1 1 1 0.0001 2e-07 0 0 0 0 1 1 2 0.0001 3e-07 0 0 0 0 1 -9999 1 0.0001 1e-07 0 0 0 0 1