pfmc-assessments / PacFIN.Utilities

R code to manipulate data from the PacFIN database for assessments
http://pfmc-assessments.github.io/PacFIN.Utilities
Other
7 stars 1 forks source link

NA values messing up calculation of Sum_Sampled_Lbs? #114

Closed iantaylor-NOAA closed 1 year ago

iantaylor-NOAA commented 1 year ago

The CalCOM data for petrale has very few years with zero NA values for Trip_Sampled_Lbs. It seems that if there's even one NA values in the year/stratum then the Sum_Sampled_Lbs equals NA as well, resulting in a 2nd stage expansion of 1.0 for the majority of the samples. This makes me think the na.rm = TRUE in this code isn't working as I would have expected: https://github.com/pfmc-assessments/PacFIN.Utilities/blob/497185807b910cbe6664d3171508e432fe473330/R/getExpansion_2.R#L144-L150

I tested using the reproducible example below, but I don't know enough about stats::ave() to figure out a fix (I can try tomorrow).

# create dummy data frame with 5 years and 2 observations per year
tows <- data.frame(
  fishyr = rep(2001:2005, 2), 
  stratification = "ALL",
  Trip_Sampled_Lbs = 1:10)
# set 1 observation to NA for each of the first 3 years
tows$Trip_Sampled_Lbs[1:3] <- NA

# run code referenced above from `getExpansion_2()`
tows[, "Sum_Sampled_Lbs"] <- stats::ave(
    x = tows$Trip_Sampled_Lbs,
    # ... are levels to aggregate over
    tows[, "fishyr"], tows[, "stratification"],
    FUN = sum, na.rm = TRUE)

this results in a table like the following. The presence of na.rm = TRUE makes me think this isn't the intended behavior.

   fishyr stratification Trip_Sampled_Lbs Sum_Sampled_Lbs
1    2001            ALL               NA              NA
2    2002            ALL               NA              NA
3    2003            ALL               NA              NA
4    2004            ALL                4              13
5    2005            ALL                5              15
6    2001            ALL                6              NA
7    2002            ALL                7              NA
8    2003            ALL                8              NA
9    2004            ALL                9              13
10   2005            ALL               10              15
kellijohnson-NOAA commented 1 year ago

Thank you @iantaylor-NOAA for reporting this 🐛. I did not realize that the ... argument does NOT work to pass through arguments to FUN, instead the ... arguments just work to provide grouping variables. I wrote this code before I knew how to use {dplyr} which can easily accommodate grouping variables and passing through arguments to the desired function. But, in case you are curious, you can take your reproducible example, that was so kindly provided 🙏, and edit the last line to

    FUN = function(x) {sum(x, na.rm = TRUE)})

and it will work

   fishyr stratification Trip_Sampled_Lbs Sum_Sampled_Lbs
1    2001            ALL               NA               6      
2    2002            ALL               NA               7      
3    2003            ALL               NA               8      
4    2004            ALL                4              13      
5    2005            ALL                5              15      
6    2001            ALL                6               6      
7    2002            ALL                7               7      
8    2003            ALL                8               8      
9    2004            ALL                9              13      
10   2005            ALL               10              15

The {dplyr} equivalent is

tows <- tows %>%
  dplyr::group_by(fishyr, stratification) %>%
  dplyr::mutate(new_sum = sum(Trip_Sampled_Lbs, na.rm = TRUE))
testthat::expect_equal(tows$new_sum,tows$Sum_Sampled_Lbs)