pbs-assess / gfdata

:floppy_disk: An R package for data extraction of BC groundfish data
https://pbs-assess.github.io/gfdata/
1 stars 0 forks source link

discrepancy in data from get_ll_hook_data() and get_survey_sets() #12

Open jdunic opened 1 year ago

jdunic commented 1 year ago

There seems to be a discrepancy in the data coming from gfdata::get_survey_sets() and gfdata::get_ll_hook_data(). From get_ll_hook_data(), it looks like count_target_species and count_non_target_species are not always correct. And I am unclear of where bait counts come from for these fishing_event_id values that correspond with these incongruent values.

Setup data to combine hook counts (baited, empty, total) with survey data

library(dplyr)

ye_survey <- gfdata::get_survey_sets(species = "yelloweye rockfish", ssid = c(39, 40))

ye_hook_dat <- 
  gfdata::get_ll_hook_data(species = "yelloweye rockfish", ssid = c(39, 40)) |>
    mutate(total_hooks = count_target_species + count_non_target_species +
            count_bait_only + count_empty_hooks - count_bent_broken)

dat <- left_join(ye_hook_dat, ye_survey)

Expected results

1. Number of rows from get_survey_sets() and get_ll_hook_data() should be equal

2. total_hooks should equal hook_count

3. count_target_species should equal catch_count

Current results

  1. More rows are returned from get_ll_hook_data() than get_survey_sets().

From what I can tell, I think that this is because of a difference in checking for usable sets in get_ll_hook_data.

nrow(ye_survey) - nrow(ye_hook_dat)
dat |> filter(is.na(catch_count))

db <- "GFBioSQL"

# check first 5 fishing_event_id values that are extra in ye_hook_dat
use_check <- gfdata::run_sql(db, 
  "SELECT FEC.TRIP_ID, FEC.FISHING_EVENT_ID, FEC.CATCH_ID, 
    C.SPECIES_CODE, C.CATCH_COUNT, 
    LS.LGLSP_HOOK_COUNT, LS.LGLSP_HOOKS_SET_COUNT, LS.LGLSP_HOOKS_LOST_COUNT, LS.USABILITY_CODE,
    U.USABILITY_DESC
   FROM FISHING_EVENT_CATCH FEC
       INNER JOIN CATCH C ON
       FEC.CATCH_ID = C.CATCH_ID
       INNER JOIN LONGLINE_SPECS LS ON
       FEC.FISHING_EVENT_ID = LS.FISHING_EVENT_ID
       INNER JOIN USABILITY U ON
       LS.USABILITY_CODE = U.USABILITY_CODE
       LEFT JOIN HOOK_SPECS HS ON
       FEC.FISHING_EVENT_ID = HS.FISHING_EVENT_ID
   WHERE FEC.FISHING_EVENT_ID IN (309481, 309483, 309514, 309516, 309519, 309521);")

# Includes usability codes not in (0, 1, 2, 6)
glimpse(use_check)

2. total_hooks does not equal hook_count (from get_survey_sets())in some cases. hook_count looks like the correct value.

3. catch_count which comes from get_survey_sets() looks like the correct value. While catch information from get_ll_hook_data() looks wrong.

Some examples of 2 and 3:

count_diffs <- dat |>
  drop_na(catch_count) |> # ignore the usability code issue
  mutate(c_diff = count_target_species - catch_count, 
         h_diff = total_hooks - hook_count) |>
  select(fishing_event_id, count_non_target_species, count_target_species, 
         count_bait_only, count_empty_hooks, count_bent_broken, total_hooks, 
         fishing_event_id, catch_count, hook_count, 
         c_diff, h_diff) |> # simplify table view
  rename(non_target = 'count_non_target_species', target = 'count_target_species', 
         bait = 'count_bait_only', empty = 'count_empty_hooks', 
         bent = 'count_bent_broken') # simplify table view

count_diffs |>
  arrange(-abs(c_diff), -abs(h_diff)) |> # show big diffs at top of table
  slice(1:5) |>
  View()

catch_dat <- gfdata::run_sql(db, 
"SELECT FEC.TRIP_ID, FEC.FISHING_EVENT_ID, FEC.CATCH_ID, 
    C.SPECIES_CODE, C.CATCH_COUNT, C.CATCH_WEIGHT, 
    S.SPECIES_COMMON_NAME
 FROM FISHING_EVENT_CATCH FEC
 INNER JOIN CATCH C ON
    FEC.CATCH_ID = C.CATCH_ID
 LEFT OUTER JOIN SPECIES S ON
    C.SPECIES_CODE = S.SPECIES_CODE
 WHERE FEC.FISHING_EVENT_ID IN (2193813, 309482, 1722030, 1722028, 4590043)")

View(catch_dat)

4. When confirming that the bait and empty counts are correct, I cannot find the fishing_event_ids from the example above in HOOK_SPECS, so where is the count_bait_only coming from in get_ll_hook_data()? I might just be confused here. How do the other values in the FISHING_EVENT table come into play?

hs <- run_sql(db, 
"SELECT *
 FROM HOOK_SPECS 
 WHERE FISHING_EVENT_ID IN (2193813, 309482, 1722030, 1722028, 4590043);")
hs
andrew-edwards commented 1 year ago

This is NOT for the IPHC survey, correct? (Then, sorry, I'm not sure I can help).

I know I had similar issues for IPHC survey (and maybe for data that was in gfbio, not just in gfiphc). Is this only happening occasionally? I'm guessing not, but sometimes a line breaks or something, and not all the hooks get fished. Someone dropped their recording sheet over the side of the boat once, and it took me half a day to track that problem down.

Aha - total_hooks might be total hooks deployed? And hence always <= hook_count.

jdunic commented 1 year ago

Correct, I only checked this difference for HBLL INS and specifically only compared information using Yelloweye.

Interestingly... total_hooks is NOT always <= hook_count... And the total_hooks calculation subtracts out bent/broken hooks. But I hadn't thought about those other reasons for missing hooks 😆that is helpful.