Translating PBS groundfish data to the DLMtool Data class

Hi @tcarruth, @adrianHordyk, @quang-huynh:

As we've discussed, it would be useful if I wrote a function to translate our PBS DFO groundfish data to a DLMtool data class. I'm trying to wrap my head around what that would look like. Below are all the slots currently in the Data class. I've checked off the slots that I assume we will typically have data for.

Do the checked/empty slots look right?
Do you have answers to any of my questions scattered throughout?
How close is the multi-fleet option? What will the Data format look like? We almost always have multiple survey indices and possible commercial CPUE indices. Even if it's not ready to roll yet, perhaps it would make sense if I wrote this code with the final format in mind. Otherwise, I'm not sure what we're going to do besides average survey indices, pick indices, or run simulations for individual areas. Although I guess we're going to be forced into that anyways for the data-limited MPs.
What would be the most efficient way for me to do this? Create an new object first with new("Data") and then populate it as needed? Or initialize the object passing in the data via .Object? Some other way? I've barely programmed with S4 classes before, so this part is mostly new to me.
Are there other obvious slots in other object classes I should be populating from our data?

Data slots:

[x] Name The name of the Data object. Single value. Character string
[x] Year Years that corresponding to catch and relative abundance data. Vector nyears long. Positive integer
[x] Cat Total annual catches. Matrix of nsim rows and nyears columns. Non-negative real numbers
[x] Ind Relative abundance index. Matrix of nsim rows and nyears columns. Non-negative real numbers
[ ] Rec Recent recruitment strength. Vector of length nsim. Positive real numbers
[x] t The number of years corresponding to AvC and Dt. Single value. Positive integer
[x] AvC Average catch over time t. Vector nsim long. Positive real numbers

Is this useful to include if we have the complete time series of catches?

[ ] Dt Depletion over time t SSB(now)/SSB(now-t+1). Vector nsim long. Fraction

From stochastic SRA

[ ] Mort Natural mortality rate. Vector nsim long. Positive real numbers
[ ] FMSY_M An assumed ratio of FMSY to M. Vector nsim long. Positive real numbers
[ ] BMSY_B0 The most productive stock size relative to unfished. Vector nsim long. Fraction
[x] L50 Length at 50 percent maturity. Vector nsim long. Positive real numbers
[x] L95 Length at 95 percent maturity. Vector nsim long. Positive real numbers
[x] ML Mean length time series. Matrix of nsim rows and nyears columns. Non-negative real numbers
[x] Lbar Mean length of catches over Lc. Matrix of nsim rows and nyears columns. Positive real numbers
[x] Lc Modal length of catches. Matrix of nsim rows and nyears columns. Positive real numbers

Should Lbar and Lc be populated if we have a mean length time series?

[x] LFC Length at first capture. Matrix of nsim rows and nyears columns. Positive real numbers

This differs from L5 from stochastic SRA? Should this come from the commercial length samples?

[x] LFS Shortest length at full selection. Matrix of nsim rows and nyears columns. Positive real numbers

From stochastic SRA LFS

[x] CAA Catch at Age data. Array of dimensions nsim x nyears x MaxAge. Non-negative integers
[ ] Dep Stock depletion SSB(current)/SSB(unfished). Vector nsim long. Fraction.

From stochastic SRA? Should this be populated if there is also Dt?

[ ] Abun An estimate of absolute current vulnerable abundance. Vector nsim long. Positive real numbers
[ ] SpAbun An estimate of absolute current spawning stock abundance. Vector nsim long. Positive real numbers
[x] vbK The von Bertalanffy growth coefficient K. Vector nsim long. Positive real numbers

As a first pass maybe it makes sense just to take a point value, but these could be samples from the posterior in cpars. Same for others below.

[x] vbLinf Maximum length. Vector nsim long. Positive real numbers
[x] vbt0 Theoretical age at length zero. Vector nsim long. Non-positive real numbers
[x] LenCV Coefficient of variation of length-at-age (assumed constant for all age classes). Vector nsim long. Positive real numbers

Is this the CV of the von Bertalanffy error?

[x] wla Weight-Length parameter alpha. Vector nsim long. Positive real numbers
[x] wlb Weight-Length parameter beta. Vector nsim long. Positive real numbers
[ ] steep Steepness of stock-recruitment relationship. Vector nsim long. Value in the range of one-fifth to 1
[x] CV_Cat Coefficient of variation in annual catches. Vector nsim long. Positive real numbers
[ ] CV_Dt Coefficient of variation in depletion over time t. Vector nsim long. Positive real numbers

Would we take this from stochastic SRA or not include it?

[ ] CV_AvC Coefficient of variation in average catches over time t. Vector nsim long. Positive real numbers

I'm not sure what this is. Should we include it?

[x] CV_Ind Coefficient of variation in the relative abundance index. Vector nsim long. Positive real numbers

Presumably from the survey CV. What have you done with commercial CPUE before? I assume make some assumption like we often do in assessments?

[ ] CV_Mort Coefficient of variation in natural mortality rate. Vector nsim long. Positive real numbers
[ ] CV_FMSY_M Coefficient of variation in the ratio in FMSY/M. Vector nsim long. Positive real numbers
[ ] CV_BMSY_B0 Coefficient of variation in the position of the most productive stock size relative to unfished. Vector nsim long. Positive real numbers
[ ] CV_Dep Coefficient of variation in current stock depletion. Vector nsim long. Positive real numbers
[ ] CV_Abun Coefficient of variation in estimate of absolute current stock size. Vector nsim long. Positive real numbers
[x] CV_vbK Coefficient of variation in the von Bertalanffy K parameter. Vector nsim long. Positive real numbers
[x] CV_vbLinf Coefficient of variation in maximum length. Vector nsim long. Positive real numbers
[x] CV_vbt0 Coefficient of variation in age at length zero. Vector nsim long. Positive real numbers
[x] CV_L50 Coefficient of variation in length at 50 per cent maturity. Vector nsim long. Positive real numbers
[ ] CV_LFC Coefficient of variation in length at first capture. Vector nsim long. Positive real numbers
[ ] CV_LFS Coefficient of variation in length at full selection. Vector nsim long. Positive real numbers
[x] CV_wla Coefficient of variation in weight-length parameter a. Vector nsim long. Positive real numbers
[x] CV_wlb Coefficient of variation in weight-length parameter b. Vector nsim long. Positive real numbers
[ ] CV_steep Coefficient of variation in steepness. Vector nsim long. Positive real numbers
[ ] sigmaL Assumed observaton error of the length composition data. Vector nsim long. Positive real numbers
[ ] MaxAge Maximum age. Vector nsim long. Positive integer
[x] CAL_bins The values delimiting the length bins for the catch-at-length data. Vector. Non-negative real numbers
[x] CAL Catch-at-length data. An array with dimensions nsim x nyears x length(CAL_bins). Non-negative integers
[ ] TAC The calculated catch limits (function TAC). An array with dimensions PosMPs x replicate TAC samples x nsim. Positive real numbers
[ ] Sense The results of the sensitivity analysis (function Sense). An array with dimensions PosMPs x sensitivity increments. Positive real numbers
[x] Units Units of the catch/absolute abundance estimates. Single value. Character string
[ ] Ref A reference management level (eg a catch limit). Single value. Positive real number
[ ] Ref_type Type of reference management level (eg 2009 catch limit). Single value. Character string
[ ] Log A record of events. Single value. Character string
[ ] params A place to store estimated parameters. An object. R list
[ ] PosMPs The methods that can be applied to these data. Vector. Character strings
[ ] MPs The methods that were applied to these data. Vector. Character strings
[ ] OM A table of operating model conditions. R table object of nsim rows. Real numbers
[ ] Obs A table of observation model conditions. R table object of nsim rows. Real numbers
[ ] Cref Reference or target catch level (eg MSY). Vector of length nsim. Positive real numbers
[ ] Iref Reference or target relative abundance index level (eg BMSY / B0). Vector of length nsim. Positive real numbers
[ ] Bref Reference or target biomass level (eg BMSY). Vector of length nsim. Positive real numbers
[ ] CV_Cref Log-normal CV for reference or target catch level. Vector of length nsim. Positive real numbers
[ ] CV_Iref Log-normalCV for reference or target relative abundance index level. Vector of length nsim. Positive real numbers
[ ] CV_Bref Log-normal CV for reference or target biomass level. Vector of length nsim. Positive real numbers
[ ] CV_Rec Log-normal CV for recent recruitment strength. Vector of length nsim. Positive real numbers
[ ] MPrec The previous recommendation of a management procedure. Vector of length nsim. Positive real numbers
[ ] MPeff The current level of effort. Vector of length nsim. Positive real numbers
[ ] LHYear The last historical year of the simulation (before projection). Single value. Positive integer
[ ] nareas Number of fishing areas. Vector of length nsim. Non-negative integer
[ ] Misc Other information for MPs. An object. R list

I assume we don't need to include anything for this last set of slots?

Some responses in reverse order:

5 - No, all other object classes (e.g Stock, Fleet, etc) are part of the MSE object. At some point we might want to look at writing functions that condition that OM on the Data object - similarly to how we do SS2OM() etc now. We've held of doing that so far, because at least in data-limited cases the MSE often includes more variability/uncertainty than the best estimates in the Data object.

4 - Initialize with Data <- new('Data') and populating by Data@Cat <- matrix(...) etc probably easiest. A second option is to write a Data.xlsx or Data.csv file and import using XL2Data. (DataInit creates a blank Data workbook). But this is likely to change as we create new Data object - see next point

3 - A little while off completion still. The question on what the Data format will look like is an important one and something I've been meaning to talk to you about this since the meeting last week. We've been talking about designing a new general Data object to incorporate all fishery and survey data. It makes sense for you to develop the function with the final format in mind. Your input on the format of the Data object would be valuable.

The place to start might be to design a multi-tab Excel workbook (we can make it compatible with OpenOffice, Google Sheets, etc later) - with the first tab something like: Life History & Biology, and then a series of identical tabs for fleets: e.g, Fleet1, Fleet2, Survey1, etc that contain slots for fishery data (eg CAA etc).

By default the MPs will use fishery data from Fleet1, but can be designed to use any or all of the fleet/survey data as required for situation.

From there we can revise the S4 Data object class and write new import functions as well as DFO-Data functions. It won't be necessary to use the Excel workbook approach to import DFO data, but I find the flat-file approach is useful for designing the S4 object. What do you think of this?

2 - Some answers to questions throughout:

Note that nsim is always 1 for real fishery data. So anything that says 'vector nsim long' is a point value. nsim > 1 is only in MSE.
Are you referring to the stochastic SRA in DLMtool? Currently SRA is only used to populate OM - eg OM <- StochasticSRA(...) . We could write a SRA function that writes the Data slots.
The Dt slot is used by the DCAC family of MPs, where depletion can be specified over t time-period instead of B/B0.
Lbar and Lc are used by some MPs instead of ML. Maybe need to be revised as we revise Data object, but could easily be calculated from CAL data.
LFC - good question - need to check
vbK etc are point estimates with corresponding CVs (eg CV_vbK)
Cref, Iref etc are the target reference values used by MPs that use slope-to-target type rules. Usually need to be manually specified as 'target' index values
A bunch of the last slots (TAC, Log, MPs, etc) don't need to be populated. MPrec and MPeff are the current TAC and effort (e.g last year) and are used by MPs that iteratively adjust TAC or effort.

1 - More or less. Except for the miscellaneous slots, all data slots should be populated if data/estimates are available. Revising the data object may mean some slots are not necessary.

Thanks @AdrianHordyk — that's very helpful!

Note that nsim is always 1 for real fishery data. So anything that says 'vector nsim long' is a point value. nsim > 1 is only in MSE.

Got it. What about for the arrays? "Array of dimensions nsim x nyears x MaxAge.": should that be array(dim = c(1, nsim, nyears)) or matrix(nrow = nsim, ncol = nyears)?

The question on what the Data format will look like is an important one and something I've been meaning to talk to you about this since the meeting last week. We've been talking about designing a new general Data object to incorporate all fishery and survey data.

I think there's an international group trying to come up with some common data format for input data into stock assessments. Perhaps you guys already know about that. It would seem like you've got a pretty good system working here already. I would think the most obvious thing to do would be to let, for example, Ind be a list of matrices (or make it an array). A list would be fairly flexible and would let the named elements match up across slots, e.g. between Ind and CV_Ind, and not require all elements to have the same number of years. I guess an array with NAs could do the same thing since all the contents would be numeric. I just tend to think of things in nested lists. Either of those options would be easy to populate on our end, so whatever's easiest on the backend for calculations is fine. I'm less tuned into what format would make sense for a spreadsheet since for our purposes we'd be going straight from a SQL database into the S4 object.

Are you referring to the stochastic SRA in DLMtool? Currently SRA is only used to populate OM - eg OM <- StochasticSRA(...) . We could write a SRA function that writes the Data slots.

Yes. For any of the data-moderate/limited cases we aren't going to have estimates of depletion etc. And even in the data rich cases we don't have the output from stock assessments in our databases (it would be nice if we did!), so we'd have to manually populate those anyways. So, I assume we want to fill in the fields that StochasticSRA() will use as input but leave things like depletion empty until optionally manually filling those in.

vbK etc are point estimates with corresponding CVs (eg CV_vbK)

OK, missed those the first time. What exactly is LenCV? CV of the error around the von Bertalanffy model fit?

Yes, arrays will still need to have the nsim dimension. So "Array of dimensions nsim x nyears x MaxAge." should be array(dim = c(1, nyears, maxage)), not array(dim = c(1, nsim, nyears)) (that may have been a typo above). When MPs are applied to real data, they always index to sim=1

I like the idea of named list for data from each fleet. For now it might make sense for you to populate a separate Data object for each fleet, and then we write a function that combines these objects into a new multi-fleet Data object. There will be some redundancy in life-history parameters like M etc, but that shouldn't be a big deal to sort out later.

Yes, agree that it makes sense to fill only the slots with fishery data and leave other unknown values to be populated manually if possible.

In the MSE we simulate length comp by assuming length-at-age is normally distributed around mean described by the vB (truncated at 2 SD). LenCV is the CV of length-at-age and used to generate the length distribution - ie add variability to length-at-age. I expect it is different to the estimated CV of say Linf when fitting the vB model. LenCV would only be used by MPs that generate length composition rather than work with mean length-at-age. I don't think there are any that use it at the moment. I'd say leave it blank or we can auto populate it with an assumed value of 0.1 or something.

OK, I've made some progress with a function to translate our data to the DLMtool data class. https://github.com/pbs-assess/gfplot/blob/master/R/pbs2mse.R

Thanks for the help in this thread and thanks for exporting the classes. There's no rush to update the CRAN DLMtool for any of our needs since we can depend on the GitHub version.

Some questions:

When we're dealing with biological samples, I've only been including females, since I assume most of these management procedures do not deal with males and females. Catch and the survey index of abundance, however, include everything. Does that make sense? Should we be combining males and females elsewhere?
Am I correct that the catch-at-age and catch-at-length arrays represent numbers of fish and not weight of catch?
I assume we should populate the years based on the oldest the data set. This will usually be the catch data, which often goes back into the 50s. The survey data will often only go back to around 2000. This means there are a lot of NAs padding the beginning of time series and CAA/CAL matrices. I assume that's the intention and should be fine for some management procedures?
Since existing data limited management procedures only deal with a single index of abundance, I assume we can either (1) generate a composite index, (2) run individual MSEs for regions of the coast with individual indices of abundance (selecting the most representative index for that region... possibly with a sensitivity analysis if there are multiple options), or (3) get into the business of creating new management procedures. I'm thinking that option 2 probably makes most sense at this point. A composite index would be particularly tricky given that surveys for different parts of the coast operate in alternating years. Does that make sense?
Is only having survey index values every second year going to rule out many MPs?
I've entered LenCV based on the residual error standard deviation from the lognormal error distribution for the vB model fit. I think that gets at the variability in length at age. Yes? As an example, for Pacific ocean perch that's about 0.06.
For the mean-length time series and the CAA and CAL, presumably the MPs are assuming these are from the commercial fleets. For commercial samples, we have samples from before any discards (unsorted) or samples from both before and after any sorting. I assume we would ideally work with only the unsorted samples. Unfortunately, there are usually far fewer of these samples, but I assume the length truncation from the sorted samples would violate assumptions of the MPs, right?
Should MaxAge represent the highest observed age in the CAA matrix or some known maximum age based on the literature or other data? I assume if MaxAge is larger than the observed maximum age in the particular data set, then the CAA should be extended with NAs and 0s? Similarly, should the CAL matrix extend beyond the observed maximum length?
I guess part of my uncertainty about whether or not to extend these CAA and CAL matrices stems from not fully understanding how this data class object will be used in populating the OM. I get that we can use this object to apply the MPs to the real data, but I assume we also want to use some of this to inform the OM and as input for things like the stochastic SRA function. Maybe after filling the data object we can use that to populate the relevant OM slots?
For the CAA and CAL matrices, I've filled in 0s for any ages or lengths in years that have some samples. For any years without any samples I've filled in NAs. Sound right?
For the various @CV_* slots that represent CVs on a parameter such as the ones in the length-weight model or the vB model, I've entered these as the standard error of the parameter divided by the mean of the parameter (unless the parameter is already on a log scale). Does this sound right? Generally these CVs are going to be tiny because there are usually a lot of data going into the various growth and maturity models.

Thanks for any help on this!

Some responses to your questions. There are a couple of things we should flag and look at in more detail once you have a Data object or two to test.

This sounds right. At the moment Data object doesn't have sex-specific biological parameters, and the MPs that use life-history info don't separate by sex. Catch and indices should be both, though I can imagine that custom MPs that use sex-specific catch data could be built - eg MPs that use length or age data of female population. Perhaps an argument in your function for catch etc data by sex e.g sex = c('all', 'Fonly', 'Monly')?
Yes, CAA and CAL are numbers
Data@Years really should be years since the beginning of exploitation - so Data@Catch represents catch data (when available) from the beginning of the fishery. So I think the way you are doing is correct - a lot of NAs at the beginning is fine
I guess it depends how the MPs will be used in terms of management advice. Is management advice being provided for individual regions on the coast or a coast wide TAC? In the former, both the MSE and the Data object should be populated on regional data.
Shouldn't be a problem, but need to look at this in more detail. Some MPs interpolate missing data, but in the MSE data is (almost) never missing, so we should test some MPs with your bi-annual survey index to make sure they are behaving as expected. May need to make some MPs more robust.
I expected it would have been a bit higher than 0.06, ~0.1 or so, but this could be correct. To generate size comps, we assume length-at-age is normally distributed but truncated at 2 standard deviations. E.g, with Linf = 100, and LenCV = 0.06, the largest size that could be generated for a fish at maximum age is 100 + 2 (0.06 * 100) = 112. It's not a critical parameter as most MPs either use expected length-at-age or mean length rather than fitting size comps, but we should check that the size comps generated in the model are similar to the data.
The Data object has slots ML and Lbar which represent mean length of all fish in catch and mean length only fully selected fish (above Lc) respectively. We should look at how these data are treated by the MPs, but I expect that you're right in that the data should be representative of the total removals.
I usually make the CAL matrix extend past Linf + 2 (Linf * LenCV) just to be sure, but I don't think it matter too much. Note that CAL_bins is the end points of the length bins, so length(CAL_bins) should be ncol(CAL) + 1. MaxAge should be the maximum observed age, and is just used to set-up the age-vectors - ie there is no issue with higher maxage other that extra computation with lots of 0s. CAA should ideally extend out to MaxAge though I think some MPs auto-fill up to MaxAge with 0s if there is no data there.
At the moment the Data object isn't used to populate the OM, at least not automatically. The reason for this is that the OM can be specified to include a broader range of hypotheses than what is in a single Data object. i.e analysts must intentionally condition an OM on a particular data set if they desire.
NAs should be fine. Again we should check that the MPs are behaving as expected with this data - eg to make sure the MP is using all available data and not just the most recent data since the last NA or something like that.
This sounds ok. The CVs are only used to provide stochastic TAC recommendations (ie when reps > 1). Uncertainty in life history parameters is probably less important than other data such as current depletion or catch bias, etc

Hi @seananderson I'm working on updating the Data object in DLMtool, mostly improving documentation, importing, and data plotting functions. These changes shouldn't have any impact on your functions, but I'm considering adding a few more slots which will.

The main slots I'm considering adding to the next release are Data@Species and Data@Location to link a Data object to a specific species and region. Perhaps later we will be adding additional info which can be used for more data-rich MPs (e.g Data@sigmaR).

Do you have any other information in your database that you think would be useful to add to the Data object?

Sorry for the delay on this. Data@Species and Data@Location sound good. The current setup covers most of what we have with the exception of usually having multiple survey indices and multiple CPUE indices.

The main data type not covered right now is spatial data. I assume there aren't any included MPs that would use the data, but foreseeably in the future there could be. Might also be nice as a standard format for transferring data. As an example, Tom was asking for spatial CPUE data recently. We have spatial biomass density from the surveys and spatial CPUE. I'm not sure what format would be easiest internally, but it would need a time element with x and y (in lat/long or UTM?) and value (spatial point data). Not a big deal if this doesn't fit with DLMtool at this point.

If we're also using this as a common data storage format for use in populating OMs via extra code then maybe an effort time series?

Data object in dev package has been updated with slots: Data@Common_Name, Data@Species, Data@Region, and for data-rich MPs, Data@sigmaR.

Spatial data will be dealt with in a future version

pbs-assess / gfmp

Translating PBS groundfish data to the DLMtool Data class #1

Data slots: