pfmc-assessments / nwfscSurvey

Tool to pull and process NWFSC West Coast groundfish survey data for use in PFMC groundfish stock assessments
http://pfmc-assessments.github.io/nwfscSurvey/
10 stars 8 forks source link

align column headers with SS3/r4ss and/or within the package? #164

Open iantaylor-NOAA opened 2 days ago

iantaylor-NOAA commented 2 days ago

@chantelwetzel-noaa, I was working on a scripted workflow yesterday to modify SS3 input files with output from {nwfscSurvey} and found it to be really easy, which is awesome.

However, there were a few differences in the column headers created by get_expanded_comps() that required modification to work (see my in-process code at https://github.com/dgbolser/Resample-survey-data/blob/add-to-assessments/resample_survey_bio_data.R#L54-L56). In particular, the nwfscSurvey uses "partition" instead of "part", "input_n" instead of "Nsamp", and capital letters for the observed data (e.g. "F12" instead of "f12"). It only took two lines of code to resolve that difference, but I'm curious if you think it's worth considering making a change to avoid the difference. Given the slower pace of SS3 releases and the disruption already caused by the previous round of r4ss changes (see https://github.com/PIFSCstockassessments/ss3diags/issues/109), I would rather not make the changes on that side even if the names used here are better. We can also just live with the difference.

The nwfscSurvey::SurveyAgeAtLen.fn() is further away from SS3/r4ss as shown in the table below.

Tagging @kellijohnson-NOAA in case there's any value in thinking about this within her ongoing changes to {PacFIN.Utilities}.

It's also OK to not deal with this right now (or ever) and continue to work around the differences.

SS3/r4ss SurveyAgeAtLen.fn()
year year
month month
fleet Fleet
sex sex
part partition
ageerr ageErr
Lbin_lo LbinLo
Lbin_hi LbinHi
Nsamp nSamps
f1 F1
f2 F2
... ...
m1 F1.1
m2 F2.1
... ...
chantelwetzel-noaa commented 2 days ago

I think switching to capital letters for the sex labeling of the composition bin is fine if we all agree the ideal naming structure should use capital letters. I will note that @kellijohnson-NOAA and I spoke about whether the input sample size column should be named Nsamp or input_n and we both agreed that the latter was the preferred term since it is more clear about what that column represents (input sample size). Similarly we can discuss whether the partition column should be part or partition. Currently I prefer the latter since it is very clear, but I could easily be convinced to move to the short column name.

Once we do decide on the best labeling for column names, we should work to ensure that each package the provides processed composition data are in agreement ({PacFIN.Utilities}, {nwfscSurvey}, {nwfscDiscard}).

iantaylor-NOAA commented 2 days ago

Given that the choice of input_n was deliberate, I would support keeping it in place along with partition (which is clearer than part) and just requiring something like dplyr::rename(part = "partition", Nsamp = "input_n").

I like the plan to ensure data in agreement within and among the three pfmc-assessments packages and am happy to help with that effort.

iantaylor-NOAA commented 2 days ago

I should have added that SS3 doesn't care about column names in input files so it would also be possible to go the other direction and modify the objects created by r4ss::SS_read() via dplyr::rename(partition = "part", input_n = "Nsamp") to achieve alignment and allow the dataframes to be combined.

kellijohnson-NOAA commented 2 days ago

Just let me know what you want them to be and I will accommodate. I hope to have this done by EOD but I am going to take a break for a bit.

iantaylor-NOAA commented 2 days ago

@kellijohnson-NOAA, thanks for being so accommodating and getting the changes done so quickly.

@chantelwetzel-noaa, what do you think of the "compromise" column below that I've just added to the previous below as a potential standard for the pfmc-assessments packages?

The one capital "L" in Lbin_lo stands out to me but I think lbin_lo could be mistaken for Ibin_lo (maybe Lbin_Lo would have been best in hindsight), but it's not worth adding additional differences from SS3.

Regarding the data bins, even if it's female only data, I think having "m" columns for the second group makes more sense than repeating the "f" throughout (or "m" in both blocks for male-only data). For single-sex models with only a single block of data bins, r4ss::SS_readdat_3.30() uses l or a (for length or age bins) instead of f and then m. But since we very rarely use the pfmc-assessment packages for single-sex models I'm not sure it's worth worrying about or aligning the notation in that case.

SS3/r4ss SurveyAgeAtLen.fn() compromise
year year year
month month month
fleet Fleet fleet
sex sex sex
part partition partition
ageerr ageErr ageerr
Lbin_lo LbinLo Lbin_lo
Lbin_hi LbinHi Lbin_hi
Nsamp nSamps input_n
f1 F1 f1
f2 F2 f2
... ... ...
m1 F1.1 m1
m2 F2.1 m2
... ... ...
chantelwetzel-noaa commented 2 days ago

Could compositions for single-sexed models use "u" instead of "l" or "a"? I think this would be more consistent but I don't know if that would create downstream impacts. I will note that {plot_comps} looks for either a f, m, or u before the composition bin number but this function could be easily revised. Otherwise I am okay with your proposed column naming. I agree that capitalizing just Lbin_hi and Lbin_lo does look strange but I understand your concern about it being easily misread if lower case l was used.

iantaylor-NOAA commented 2 days ago

@chantelwetzel-noaa, good idea. Using "u" for single-sex models makes sense to me. Here's the resulting plan, please edit if you see something wrong.

lengths ages single-sex (ages)
year year year
month month month
fleet fleet fleet
sex sex sex
partition partition partition
  ageerr ageerr
  Lbin_lo Lbin_lo
  Lbin_hi Lbin_hi
input_n input_n input_n
f1 f1 u1
f2 f2 u2
... ... ...
m1 m1  
m2 m2  
... ...