pfmc-assessments / PacFIN.Utilities

R code to manipulate data from the PacFIN database for assessments
http://pfmc-assessments.github.io/PacFIN.Utilities
Other
7 stars 1 forks source link

get access to cleaned ageing method (aggregating across AGE_METHOD columns)? #95

Closed iantaylor-NOAA closed 1 year ago

iantaylor-NOAA commented 1 year ago

Is your feature request related to a problem? Please describe. Running cleanPacFIN(..., keep_age_method = c("B", "S")) calls on getAge() to find all ageing method codes within any of the columns AGE_METHOD1, AGE_METHOD2, etc. which correspond to the requested methods. However, the resulting table doesn't include information on which samples matched which method.

In the case of Petrale, the assignment of ageing error matrix in SS3 depends on whether it was "B" = break and burn, or "S" = surface (as well as the ageing lab and year). To get this information, I first copied from getAge() the mapping of all the codes into "B" and "S", and then applied the following rules (adapted from 2019 code by @chantelwetzel-noaa), to create a single column which represents the best ageing method (where I'm assuming that if any of the columns contain a "B" then the ageing error will be similar to that estimated for break and burn, even if a different read was via surface).

Mapping of codes to ages from getAge() used for Petrale outside of cleanPacFIN(): https://github.com/pfmc-assessments/PacFIN.Utilities/blob/9618d19f6dc66f99b3ab78b5e23366229aa86f16/R/getAge.R#L113-L125

Further processing of the mapped values

# Assign an ageing method of "B" if any of the 3 methods listed is "B"
# Assign an ageing method of "S" if any of the 3 methods listed is "S" and there is no "B" among them
# Assign to NA if there are none at all
Pdata <- Pdata %>%
  dplyr::mutate(
    agemethod =
      dplyr::case_when(
        AGE_METHOD1 %in% "B" | AGE_METHOD2 %in% "B" | AGE_METHOD3 %in% "B" ~ "B", # if any B
        AGE_METHOD1 %in% "S" | AGE_METHOD2 %in% "S" | AGE_METHOD3 %in% "S" ~ "S", # if any S but no B
        TRUE ~ NA # neither B nor S in any column
      ) 
  )

Describe the solution you'd like If Petrale Sole is the only stock where the choice of ageing error matrix depends on the AGE_METHOD values, then no change to {PacFIN.Utilities} is needed and we can continue to do the processing exactly as I've just done. However, if it would be beneficial for other stocks, perhaps getAge() could be augmented to create the agemethod (by whatever name) using some more generalized version of the code above that would work regardless of the number of AGE_METHOD columns.

Describe alternatives you've considered No change.

kellijohnson-NOAA commented 1 year ago
@iantaylor-NOAA thanks for the great idea. In theory, PacFIN should supply this information in the column FINAL_FISH_AGE_CODE but they do not. So, I have created a function (not yet pushed) called getAgeCode() that we can add to cleanPacFIN() to populate this column. I am replying largely to ask a design question. When the age from two different methods match the best age, e.g., age1 age2 age3 FINAL_FISH_AGE_IN_YEARS Age_Method1 Age_Method2 Age_Method3
1 1 3 1 "B" "S" NA

what should the returned Age_Method be? I have currently coded it to paste all age methods that match the age for the final age separated by "--". For example, if you run getAgeCode() on the petrale sole biological data you get the following values

out <- getAgeMethod(bds.pacfin)
table(out, useNA = "always")
out
      1       2       B    B--S      BB BB--UNK       N       S    <NA>
  14567   38600    7224     528    4474       6      43   39283  158378

It is noted that I should recode these to good methods, like 1 should be "B" prior to returning.

iantaylor-NOAA commented 1 year ago

@kellijohnson-NOAA, thanks for working on this. I think your design makes sense for now. We could consider refining it in the future after having a conversation with Patrick during a less busy time.

For instance, if there were a case where the final age matched the surface age and not the break and burn age, as in the fictional table below, I would assume that the agers took an extra look at the otolith to resolve the differences and settled on a value that happened to match the surface read. But if you wanted to know which ageing error matrix to apply, or were filtering out all the surface reads (which we always assume are more uncertain), I think it would be best to treat this sample like a break and burn read even if the value didn't match. This is essentially how the 2019 code from @chantelwetzel-noaa worked, assigning "B" if there was any "B" in the set of ageing methods, regardless of which age matched the final age.

age1 age2 FINAL_FISH_AGE_IN_YEARS Age_Method1 Age_Method2
5 7 7 "B" "S"
brianlangseth-NOAA commented 1 year ago

I agree with Ian that assigning "B" is preferable. The downside is that any surface read that matches a break and burn wont really register in the surface ageing error matrix, because "B" will be assigned. I wonder if that may artificially inflate the error around "S" reads. I dont really see a way around it though. We could have fish that contribute to both error matrices, but that seems sketchy.

@iantaylor-NOAA What would you propose for situations were two ages are averaged, and the methods are "B" and "S"? From above it looks like you are using "B" so long as one was "B". For canary there were a number of ages without FINAL_FISH_AGE_IN_YEARS for washington. The individual reads were often far enough apart to make me question using an average, as getAge() sometimes does. For example, I have a fish aged 23 by "S" and 72! by "B" with an Age from getAge() of 48. Seems like there is a lot of error added if we say this method is "B". Obviously, this is the more extreme case.

Might I suggest that if any ages are averaged across reads with different methods, they be excluded from calculation within the ageing error matrix, especially if those reads are very different? In these cases, flagging the method as "avgBS" could be an easy flag for exclusion. So if ages are the same, use "B", but if averaged across two methods, use something like "avgBS".

chantelwetzel-noaa commented 1 year ago

I think providing some background information here would be useful. Surface reads, especially for longer lived fish, are very biased where the surface read significantly underestimates the true fish age. In the last Pacific ocean perch assessment there is some information around this issue. In that assessment all surface read ages were not used in the assessment. They could have been kept if the user applied a an ageing error that specifically dealt with this but assigning the correct level of bias may be difficult.

In the case of petrale sole, there is a mixture of surface reads due to several issues. Some of those surface reads are simply due to the traditional surface reading approach. However, at some point when it became clearer that this method could significantly underestimate the fish age, for petrale sole at least, there was a period of time where any otoliths that were clearly young fish were read by surface reads while any otolith that seemed to be from an older fish was read via break-and-burn. This mixture of reading approaches across time is one of the many reasons why the petrale sole probably has the highest number of ageing error vectors.

Finally, there are also some otoliths that were read via a surface read approach but then later on reread using break and burn. In this case we should try to default to that break and burn age if clearly linked in the data.

iantaylor-NOAA commented 1 year ago

Good points @brianlangseth-NOAA and @chantelwetzel-noaa.

Regarding missing FINAL_FISH_AGE_IN_YEARS, I would always choose any "B" age rather than take an average of "B" and "S".

Regarding ageing error estimation, at this point we typically request the double read data separately from what gets pulled by PacFIN.Utilities, so I don't think we need to worry yet about how our functions for making comps interact with that estimation process.

As the comment from @chantelwetzel-noaa makes clear, there are humans involved in this process making logical decisions, so if we can take time in the future to better understand those decision-making processes and all the different ways that they lead to the data we see in the database, then our functions and analyses will be better.

We could also do more sensitivities in the future to figure out how much any of this matters once the ages are all added up and put into an assessment model.