function to generate input discard rates for SS

mkapur commented 3 years ago

Hi All I can't recall if I mentioned this to Ian/Chantel in pre-COVID times, but I have a simple chunk of dplyr() code I've been using with the generated DisRatio csvs that come out of this package. The code simply computes the proportional catches and re-weights the discard rates for years 2011+ (I've only tested it for sablefish).

Let me know if it would be worth functionalizing this for this package (not sure if you have group policies about dplyr or not). I would also need to include a formatting step to bind the reweighted values to the earlier years and set it up for SS use. Let me know if anyone on your team has a starting point for that, to avoid redundant effort :)

MK

## disrate_cs and disrate_ncs are the boot csvs; disrate_cs only has years 2011+ accordingly

discard_late <- merge(disrate_cs %>% select(ryear,'CS_LBS' = Observed_RETAINED.MTS, 
                                            "CS_RATIO" = Observed_Ratio, GEAR = gear2 ) ,
                      disrate_ncs  %>% 
                        select(ryear,'NCS_LBS' = Median.Boot_RETAINED.MTS , 
                               "NCS_RATIO" = Observed_Ratio, GEAR = gear2 ) , by = c('GEAR','ryear')) %>%
  mutate(tot = CS_LBS+ NCS_LBS) %>%
  group_by(ryear, GEAR) %>%
  summarise(cs_prop = CS_LBS/tot,
            ncs_prop = NCS_LBS/tot,
            cs_propxrate  = cs_prop*CS_RATIO,
            ncs_propxrate  = ncs_prop*NCS_RATIO,
            total_disrate = cs_propxrate+ncs_propxrate)

chantelwetzel-noaa commented 3 years ago

@mkapur This type of function may be of interest since folks are often doing this themselves. However, I think individuals may be weighting the rates by total catches, not just the observed totals. I would have to think about this difference and how and when it would matter. We could just create a switch to allow people to weight by the observed totals or their own totals.

There is an additional observation type which we are now providing - EM data (electronic monitoring) which has data starting ~2015 I believe (these data are part of the catch share program but just have video monitoring rather than a person). Also, I have since updated some of the outputs to all be in metric tons rather than pounds. If that is all the code, I can test this approach out and see what changes need to be made to account for the updated columns, the additional data source, and if the approach generally makes sense on how we think the data should be treated.

mkapur commented 3 years ago

Thanks for the quick reply @chantelwetzel-noaa. I did notice that the outputs had changed from LBS to MT since my last go-round (2019 sablefish assessment) -- I actually used the MTs here but didn't just didn't change the output column name. That is a good point about the true total vs observed total, and so long as those were subsettable in the CSV I don't see why it couldn't be an option.

Yes, this is basically "it" for the code. My cursory comparison between the 2019 sablefish benchmark and using this method resulted in fairly similar values -- I'm not sure if these differences would be due to revisions in the catch history (see attached fig). If not, I can dig into why.

Let me know how I can help!

chantelwetzel-noaa commented 3 years ago

Those changes are fairly minor and they may be due to changes in the WCGOP data. They program updates and refines their methods frequently and adjust historical data based upon the current best practices. If you want to create a new function for this that would be great.

iantaylor-NOAA commented 3 years ago

@mkapur, thanks for offering this up. I don't have anything to add on the process of weighting discard rates, but to answer your question about "group policies about dplyr", I don't think we have any policies. Most of us just learned R long before it was launched and have been slow to adopt it.

Dependency on dplyr was recently added to r4ss as discussed at https://github.com/r4ss/r4ss/pull/455#issuecomment-758315443.

mkapur commented 3 years ago

Great -- good to know about dplyr dependency. Here's a start which spits out the table/text file for SS3, and the user can specify which # of years they'd like to update:

https://github.com/mkapur/kaputils/blob/master/R/reweight_discards.R

@chantelwetzel-noaa to confirm, when folks weight by total catches, I'm assuming this involves pulling in a second dataset. If there is a standard format for how this is getting done let me know.

chantelwetzel-noaa commented 3 years ago

I ran your function yesterday on data for Dover sole and a I realized a few items. First is that this function will need to have the ability to create weighted discard rates by not only catch share vs. non-catch share, but also by gear type (e.g. fixed gear vs. trawl gear) and by area (e.g. north vs. south or by state). This could be done via the fleet input if we instruct people to concatenate the levels and pass this column into the function in the "fleet" input (e.g. CA_trawl, CA_fixed, OR_WA_trawl, OR_WA_fixed). Alternatively, we could add additional functionality for multiple levels that could be left NA and skipped if they are not used.

The second item that became clear to me is how and when you would want to use this function. This function creates weighted discard rates between catch share and non-catch share, in the current form, for a gear type (I added "state" to the original code to do this by area). Fleet groupings by assessment can vary quite a bit. I think some of the common grouping are:

Coastwide fleet with all gears combined (e.g. Fishery Fleet aka POP 2017).
Coastwide fleet with a specified gear grouping (e.g. trawl and fixed gear aka sablefish 2019).
Area based fleets with all gears combined (e.g. CA Fishery, OR/WA Fishery aka petrale sole).
Area based fleets with specified gear grouping (e.g. CA Trawl, CA Fixed Gear, OR/WA Trawl, OR/WA Fixed Gear).

I think your code works well to create a weighted discard rate for item 2 (and can do 4 with minor modifications). The instance where I was thinking someone might want to weight by total removal based on gears would be reflect in item 1 and 3 above. After thinking about this a bit more, I think this can be accomplished a couple of different ways which would not effect this function so I think we can ignore that dimension for now.

One other question I have that @iantaylor-NOAA or @kellijohnson-NOAA may have thoughts on is where we should put this type of function. Since it is working with WCGOP discard data it does make some sense to stick it in the nwfscDiscard function, however, team members only use of this package would be this single function (only the WCGOP discard point people are using anything in this package). Given that, we may want to put it somewhere else to facilitate usage. I am unsure of the "right" location.

iantaylor-NOAA commented 3 years ago

@chantelwetzel-noaa your question about where to put the code raises a larger question about the workflow for our discard data. For assessments that have separate fleets by gear type (options 2 or 4) in your list, I'm guessing that many of the assessors would LOVE to just get a single table with reasonably weighted discard rates and not have to run the code themselves. How much more work would it be for the WCGOP discard point people to add that step instead of providing separate tables for catch-shares, non-catch-shares, and EM discard rates?

chantelwetzel-noaa commented 3 years ago

It may not be much extra work as long as Maia's function has the flexibility to combine rates across areas and or gears in an easy fashion that is consistent with the coding terminology. I quickly remembered yesterday that I do not even have a remote understanding what the dplyr approach is doing, so me having to modify this code would be challenging and don't even get me started on "tibble tables"...

With that said, I know people would appreciate a single table but I have some reservations about that. I think it is important for people to have to look at the data based on their stratification and to really think about how best to use that information. If I provided a combined data rate people may just run with it which may or may not be the right approach.

iantaylor-NOAA commented 1 year ago

Since this issue is still open, maybe it's a good place to note some of the diverse approaches to aggregating the discard rates:

In 2021 various teams combined the rates across multiple sectors such as via the function shared by @mkapur above or via this function for lingcod: https://github.com/pfmc-assessments/lingcod/blob/main/R/discard_rates_combined.R
In 2023 we have a different set of issues
- For some of the rockfish species, @chantelwetzel-noaa suggests (in response to a question from @EJDick-NOAA) that the non-catch-shares observer coverage is an issue and suggests using GEMM data as noted in this https://github.com/pfmc-assessments/nwfscDiscard/issues/11#issuecomment-1472898558 (and I'm guessing this is the code used for Copper referenced in that comment: https://github.com/pfmc-assessments/copper_rockfish_2023/blob/main/R/pacfin_catch/process_gemm.R)
- For petrale sole, the non-catch-shares catch during the catch-shares era from 2011 onward is so small that we decided to just use the ncs rates for the period up to 2010 and the catch-shares rates for 2011 onward with no averaging.
- Catch associated with electronic monitoring has been low enough that it has not been factored into the 2023 estimates of discarding.

Someday maybe the dust will settle on how best to use these data, but it sounds like in the short term we need to make sure we're looking at the different sources independently and evaluating the needs for each species as a first step so are not ready for a function to combine the various sources into a Stock Synthesis input.

Fellow 2023 assessors, please feel free to add to or revise this entry if I'm missing something important.

pfmc-assessments / nwfscDiscard

function to generate input discard rates for SS #1