pbs-assess / gfmp

:zap: Pacific Groundfish Management Procedure Framework
0 stars 3 forks source link

Translating PBS groundfish data to the DLMtool Data class #1

Closed seananderson closed 4 years ago

seananderson commented 6 years ago

Hi @tcarruth, @adrianHordyk, @quang-huynh:

As we've discussed, it would be useful if I wrote a function to translate our PBS DFO groundfish data to a DLMtool data class. I'm trying to wrap my head around what that would look like. Below are all the slots currently in the Data class. I've checked off the slots that I assume we will typically have data for.

  1. Do the checked/empty slots look right?
  2. Do you have answers to any of my questions scattered throughout?
  3. How close is the multi-fleet option? What will the Data format look like? We almost always have multiple survey indices and possible commercial CPUE indices. Even if it's not ready to roll yet, perhaps it would make sense if I wrote this code with the final format in mind. Otherwise, I'm not sure what we're going to do besides average survey indices, pick indices, or run simulations for individual areas. Although I guess we're going to be forced into that anyways for the data-limited MPs.
  4. What would be the most efficient way for me to do this? Create an new object first with new("Data") and then populate it as needed? Or initialize the object passing in the data via .Object? Some other way? I've barely programmed with S4 classes before, so this part is mostly new to me.
  5. Are there other obvious slots in other object classes I should be populating from our data?

Data slots:

Is this useful to include if we have the complete time series of catches?

From stochastic SRA

Should Lbar and Lc be populated if we have a mean length time series?

This differs from L5 from stochastic SRA? Should this come from the commercial length samples?

From stochastic SRA LFS

From stochastic SRA? Should this be populated if there is also Dt?

As a first pass maybe it makes sense just to take a point value, but these could be samples from the posterior in cpars. Same for others below.

Is this the CV of the von Bertalanffy error?

Would we take this from stochastic SRA or not include it?

I'm not sure what this is. Should we include it?

Presumably from the survey CV. What have you done with commercial CPUE before? I assume make some assumption like we often do in assessments?

I assume we don't need to include anything for this last set of slots?

AdrianHordyk commented 6 years ago

Some responses in reverse order:

5 - No, all other object classes (e.g Stock, Fleet, etc) are part of the MSE object. At some point we might want to look at writing functions that condition that OM on the Data object - similarly to how we do SS2OM() etc now. We've held of doing that so far, because at least in data-limited cases the MSE often includes more variability/uncertainty than the best estimates in the Data object.

4 - Initialize with Data <- new('Data') and populating by Data@Cat <- matrix(...) etc probably easiest. A second option is to write a Data.xlsx or Data.csv file and import using XL2Data. (DataInit creates a blank Data workbook). But this is likely to change as we create new Data object - see next point

3 - A little while off completion still. The question on what the Data format will look like is an important one and something I've been meaning to talk to you about this since the meeting last week. We've been talking about designing a new general Data object to incorporate all fishery and survey data. It makes sense for you to develop the function with the final format in mind. Your input on the format of the Data object would be valuable.

The place to start might be to design a multi-tab Excel workbook (we can make it compatible with OpenOffice, Google Sheets, etc later) - with the first tab something like: Life History & Biology, and then a series of identical tabs for fleets: e.g, Fleet1, Fleet2, Survey1, etc that contain slots for fishery data (eg CAA etc).

By default the MPs will use fishery data from Fleet1, but can be designed to use any or all of the fleet/survey data as required for situation.

From there we can revise the S4 Data object class and write new import functions as well as DFO-Data functions. It won't be necessary to use the Excel workbook approach to import DFO data, but I find the flat-file approach is useful for designing the S4 object. What do you think of this?

2 - Some answers to questions throughout:

1 - More or less. Except for the miscellaneous slots, all data slots should be populated if data/estimates are available. Revising the data object may mean some slots are not necessary.

seananderson commented 6 years ago

Thanks @AdrianHordyk — that's very helpful!

Note that nsim is always 1 for real fishery data. So anything that says 'vector nsim long' is a point value. nsim > 1 is only in MSE.

Got it. What about for the arrays? "Array of dimensions nsim x nyears x MaxAge.": should that be array(dim = c(1, nsim, nyears)) or matrix(nrow = nsim, ncol = nyears)?

The question on what the Data format will look like is an important one and something I've been meaning to talk to you about this since the meeting last week. We've been talking about designing a new general Data object to incorporate all fishery and survey data.

I think there's an international group trying to come up with some common data format for input data into stock assessments. Perhaps you guys already know about that. It would seem like you've got a pretty good system working here already. I would think the most obvious thing to do would be to let, for example, Ind be a list of matrices (or make it an array). A list would be fairly flexible and would let the named elements match up across slots, e.g. between Ind and CV_Ind, and not require all elements to have the same number of years. I guess an array with NAs could do the same thing since all the contents would be numeric. I just tend to think of things in nested lists. Either of those options would be easy to populate on our end, so whatever's easiest on the backend for calculations is fine. I'm less tuned into what format would make sense for a spreadsheet since for our purposes we'd be going straight from a SQL database into the S4 object.

Are you referring to the stochastic SRA in DLMtool? Currently SRA is only used to populate OM - eg OM <- StochasticSRA(...) . We could write a SRA function that writes the Data slots.

Yes. For any of the data-moderate/limited cases we aren't going to have estimates of depletion etc. And even in the data rich cases we don't have the output from stock assessments in our databases (it would be nice if we did!), so we'd have to manually populate those anyways. So, I assume we want to fill in the fields that StochasticSRA() will use as input but leave things like depletion empty until optionally manually filling those in.

vbK etc are point estimates with corresponding CVs (eg CV_vbK)

OK, missed those the first time. What exactly is LenCV? CV of the error around the von Bertalanffy model fit?

AdrianHordyk commented 6 years ago

Yes, arrays will still need to have the nsim dimension. So "Array of dimensions nsim x nyears x MaxAge." should be array(dim = c(1, nyears, maxage)), not array(dim = c(1, nsim, nyears)) (that may have been a typo above). When MPs are applied to real data, they always index to sim=1

I like the idea of named list for data from each fleet. For now it might make sense for you to populate a separate Data object for each fleet, and then we write a function that combines these objects into a new multi-fleet Data object. There will be some redundancy in life-history parameters like M etc, but that shouldn't be a big deal to sort out later.

Yes, agree that it makes sense to fill only the slots with fishery data and leave other unknown values to be populated manually if possible.

In the MSE we simulate length comp by assuming length-at-age is normally distributed around mean described by the vB (truncated at 2 SD). LenCV is the CV of length-at-age and used to generate the length distribution - ie add variability to length-at-age. I expect it is different to the estimated CV of say Linf when fitting the vB model. LenCV would only be used by MPs that generate length composition rather than work with mean length-at-age. I don't think there are any that use it at the moment. I'd say leave it blank or we can auto populate it with an assumed value of 0.1 or something.

seananderson commented 6 years ago

OK, I've made some progress with a function to translate our data to the DLMtool data class. https://github.com/pbs-assess/gfplot/blob/master/R/pbs2mse.R

Thanks for the help in this thread and thanks for exporting the classes. There's no rush to update the CRAN DLMtool for any of our needs since we can depend on the GitHub version.

Some questions:

  1. When we're dealing with biological samples, I've only been including females, since I assume most of these management procedures do not deal with males and females. Catch and the survey index of abundance, however, include everything. Does that make sense? Should we be combining males and females elsewhere?

  2. Am I correct that the catch-at-age and catch-at-length arrays represent numbers of fish and not weight of catch?

  3. I assume we should populate the years based on the oldest the data set. This will usually be the catch data, which often goes back into the 50s. The survey data will often only go back to around 2000. This means there are a lot of NAs padding the beginning of time series and CAA/CAL matrices. I assume that's the intention and should be fine for some management procedures?

  4. Since existing data limited management procedures only deal with a single index of abundance, I assume we can either (1) generate a composite index, (2) run individual MSEs for regions of the coast with individual indices of abundance (selecting the most representative index for that region... possibly with a sensitivity analysis if there are multiple options), or (3) get into the business of creating new management procedures. I'm thinking that option 2 probably makes most sense at this point. A composite index would be particularly tricky given that surveys for different parts of the coast operate in alternating years. Does that make sense?

  5. Is only having survey index values every second year going to rule out many MPs?

  6. I've entered LenCV based on the residual error standard deviation from the lognormal error distribution for the vB model fit. I think that gets at the variability in length at age. Yes? As an example, for Pacific ocean perch that's about 0.06.

  7. For the mean-length time series and the CAA and CAL, presumably the MPs are assuming these are from the commercial fleets. For commercial samples, we have samples from before any discards (unsorted) or samples from both before and after any sorting. I assume we would ideally work with only the unsorted samples. Unfortunately, there are usually far fewer of these samples, but I assume the length truncation from the sorted samples would violate assumptions of the MPs, right?

  8. Should MaxAge represent the highest observed age in the CAA matrix or some known maximum age based on the literature or other data? I assume if MaxAge is larger than the observed maximum age in the particular data set, then the CAA should be extended with NAs and 0s? Similarly, should the CAL matrix extend beyond the observed maximum length?

  9. I guess part of my uncertainty about whether or not to extend these CAA and CAL matrices stems from not fully understanding how this data class object will be used in populating the OM. I get that we can use this object to apply the MPs to the real data, but I assume we also want to use some of this to inform the OM and as input for things like the stochastic SRA function. Maybe after filling the data object we can use that to populate the relevant OM slots?

  10. For the CAA and CAL matrices, I've filled in 0s for any ages or lengths in years that have some samples. For any years without any samples I've filled in NAs. Sound right?

  11. For the various @CV_* slots that represent CVs on a parameter such as the ones in the length-weight model or the vB model, I've entered these as the standard error of the parameter divided by the mean of the parameter (unless the parameter is already on a log scale). Does this sound right? Generally these CVs are going to be tiny because there are usually a lot of data going into the various growth and maturity models.

Thanks for any help on this!

AdrianHordyk commented 6 years ago

Some responses to your questions. There are a couple of things we should flag and look at in more detail once you have a Data object or two to test.

  1. This sounds right. At the moment Data object doesn't have sex-specific biological parameters, and the MPs that use life-history info don't separate by sex. Catch and indices should be both, though I can imagine that custom MPs that use sex-specific catch data could be built - eg MPs that use length or age data of female population. Perhaps an argument in your function for catch etc data by sex e.g sex = c('all', 'Fonly', 'Monly')?

  2. Yes, CAA and CAL are numbers

  3. Data@Years really should be years since the beginning of exploitation - so Data@Catch represents catch data (when available) from the beginning of the fishery. So I think the way you are doing is correct - a lot of NAs at the beginning is fine

  4. I guess it depends how the MPs will be used in terms of management advice. Is management advice being provided for individual regions on the coast or a coast wide TAC? In the former, both the MSE and the Data object should be populated on regional data.

  5. Shouldn't be a problem, but need to look at this in more detail. Some MPs interpolate missing data, but in the MSE data is (almost) never missing, so we should test some MPs with your bi-annual survey index to make sure they are behaving as expected. May need to make some MPs more robust.

  6. I expected it would have been a bit higher than 0.06, ~0.1 or so, but this could be correct. To generate size comps, we assume length-at-age is normally distributed but truncated at 2 standard deviations. E.g, with Linf = 100, and LenCV = 0.06, the largest size that could be generated for a fish at maximum age is 100 + 2 (0.06 * 100) = 112. It's not a critical parameter as most MPs either use expected length-at-age or mean length rather than fitting size comps, but we should check that the size comps generated in the model are similar to the data.

  7. The Data object has slots ML and Lbar which represent mean length of all fish in catch and mean length only fully selected fish (above Lc) respectively. We should look at how these data are treated by the MPs, but I expect that you're right in that the data should be representative of the total removals.

  8. I usually make the CAL matrix extend past Linf + 2 (Linf * LenCV) just to be sure, but I don't think it matter too much. Note that CAL_bins is the end points of the length bins, so length(CAL_bins) should be ncol(CAL) + 1. MaxAge should be the maximum observed age, and is just used to set-up the age-vectors - ie there is no issue with higher maxage other that extra computation with lots of 0s. CAA should ideally extend out to MaxAge though I think some MPs auto-fill up to MaxAge with 0s if there is no data there.

  9. At the moment the Data object isn't used to populate the OM, at least not automatically. The reason for this is that the OM can be specified to include a broader range of hypotheses than what is in a single Data object. i.e analysts must intentionally condition an OM on a particular data set if they desire.

  10. NAs should be fine. Again we should check that the MPs are behaving as expected with this data - eg to make sure the MP is using all available data and not just the most recent data since the last NA or something like that.

  11. This sounds ok. The CVs are only used to provide stochastic TAC recommendations (ie when reps > 1). Uncertainty in life history parameters is probably less important than other data such as current depletion or catch bias, etc

AdrianHordyk commented 6 years ago

Hi @seananderson I'm working on updating the Data object in DLMtool, mostly improving documentation, importing, and data plotting functions. These changes shouldn't have any impact on your functions, but I'm considering adding a few more slots which will.

The main slots I'm considering adding to the next release are Data@Species and Data@Location to link a Data object to a specific species and region. Perhaps later we will be adding additional info which can be used for more data-rich MPs (e.g Data@sigmaR).

Do you have any other information in your database that you think would be useful to add to the Data object?

seananderson commented 6 years ago

Sorry for the delay on this. Data@Species and Data@Location sound good. The current setup covers most of what we have with the exception of usually having multiple survey indices and multiple CPUE indices.

The main data type not covered right now is spatial data. I assume there aren't any included MPs that would use the data, but foreseeably in the future there could be. Might also be nice as a standard format for transferring data. As an example, Tom was asking for spatial CPUE data recently. We have spatial biomass density from the surveys and spatial CPUE. I'm not sure what format would be easiest internally, but it would need a time element with x and y (in lat/long or UTM?) and value (spatial point data). Not a big deal if this doesn't fit with DLMtool at this point.

If we're also using this as a common data storage format for use in populating OMs via extra code then maybe an effort time series?

AdrianHordyk commented 6 years ago

Data object in dev package has been updated with slots: Data@Common_Name, Data@Species, Data@Region, and for data-rich MPs, Data@sigmaR.

Spatial data will be dealt with in a future version