`cdc.csv` not up to date with BC Conservation Data Centre (CDC) fish data

poissonconsulting / fishbc

An R package of Fish Codes for British Columbia

https://poissonconsulting.github.io/fishbc/

Creative Commons Attribution 4.0 International

5 stars 3 forks source link

`cdc.csv` not up to date with BC Conservation Data Centre (CDC) fish data #13

Open lucy-schick opened 5 months ago

lucy-schick commented 5 months ago

The Bug

The data in cdc.csv is not up to date with the BC Conservation Data Centre (CDC).

A Reprex

Using Oncorhynchus nerka as an example, in the CDC database Oncorhynchus nerkaand many of its populations have a COSEWIC status (screenshot 1), but this is not present in cdc.csv (screenshot 2).

Other information is not up to date either, including review dates and Provincial statuses, so this makes me think the cdc.csv just hasn't been updated in a while (and github says it hasn't been edited in 4 years). Any guidance on an update workflow would be much appreciated. Thanks

Screen Shot 2024-04-07 at 3 10 10 PM

Screen Shot 2024-04-07 at 3 15 27 PM

NewGraphEnvironment commented 5 months ago

This is an extremely useful package that we use all the time. Thank you so much for having it available. We are keen to contribute if we can be helpful.

export of both Results and Conservation Status Data look slightly different from cdc.csv (see below) so we are guessing there was some wrangling required originally and/or required due to changes in how BC Species & Ecosystems Explorer exports.

Had a look for some wrangling tracking in the repo but did not find it so either I missed it or perhaps someone talented (ie. Evan) may have something locally? We can look into putting together a PR for a workflow for updating the current csv using the same column names with or without some kind of template of past moves to work off of but figured we should most definitely check in first as there is likely lots we don't know.

It's good to see that BC Species & Ecosystems Explorer has been updating their COSEWIC info and it makes me wonder if there is access to that data raw through an API. I did a quick search of https://search.open.canada.ca/data/ through the front door and API to no prevail so guessing BC Species & Ecosystems Explorer is best option...

library("rgovcan")
library("ckanr")
library("tidyverse")

# set up the connection to the data portal
ckanr_setup(url = "https://open.canada.ca/data/en")

govcan_search(keywords = c("COSEWIC"), records = 100, format_results = TRUE) %>% 
  pull(resources) %>% 
  bind_rows()

exported and converted to csv

joethorley commented 5 months ago

@newgraph-lschick and @NewGraphEnvironment - thanks for the interest in this package.

We are keen to keep it up to date. I'm the maintainer while @evanamiesgalonski is on leave so will make decisions.

I agree with your suggested outline. ie

1) Ideally would pull using API but will settle for manual download if all that is available 2) should wrangle from downloaded file (saved as csv) to format for import as data frame in package in data-raw.R script in data-raw directory so record of changes (I couldn't locate any record of how we made changes previously) 3) as much as possible we want the CDC data as is on the CDC site.

If you are able to do a PR that fits with these requirements that would be fantastic. Let me know if I missed anything. Thanks!

NewGraphEnvironment commented 5 months ago

Had an initial look at what is going on. Not surprisingly there is lots of complexity here related to past/current formats and even within content (species present before but no longer detailed etc.).

Will continue to work on this as time allows and we likely produce some sort of markdown review to communicate the story of the design decisions that were or will need to be made and potentially short term work arounds to fulfill current reporting obligations.

Worth some more effort to determine if indeed using the point and click BC Ecosystem Explorer interface is our only viable (and sad) option since the wrangle is not trivial (as usual).

joethorley commented 5 months ago

Yes an API on the government sites would ensure data use can be automated minimizing errors and ensuring information is up to date....

NewGraphEnvironment commented 4 months ago

looks like api access is not yet possible :<

answers are from contact at BC Conservation Data Centre

lucy-schick commented 1 month ago

Hi @joethorley, I hope summer is going well! I've been taking a crack at getting this sorted and have a couple questions for you:

Do you know where the freshwaterfish.csv data comes from? We need to update this file as well for the tests in data-raw.R to pass. I've looked into the available CDC exports but none seem right. m
I've taken two approaches to this issue and would like to know which one suits you guys best:

Option 1:

I updated the cdc.csv file with updated data from the CDC and made it look (almost) identical to the old cdc.csv file. This involved removing lots of columns and re-formating certain columns, which took a bunch of scripting. I explained all these steps in cdc.Rmd which can be found in my branch updated-data here https://github.com/lucy-schick/fishbc/tree/updated_data. This option is good if you guys care about the data being in the same format as before.

Option 2:

I downloaded the same data from the CDC but just left it raw. This option didn't require any scripting and is fast but does results in having many more columns which may be more data than you guys want. I'm not too sure. Since the cdc.csv file hasn't been updated in ~4 years and there doesn't seem to be any record of how this file was put together previously, @NewGraphEnvironment and I thought maybe it was the raw download from the CDC. And since then, the CDC has likely added more data (and therefore columns) that can be exported, explaining the differences we see between the old cdc.csv file and what can be exported today. Just a thought. This updated-raw cdc.csvfile can be found in my branch data_update_aug_2024 here https://github.com/lucy-schick/fishbc/tree/data_update_aug_2024

The only issue with this option is that the following test in data-raw.R won't pass because the CDCcodes need to be updated in the freshwaterfish.csv file. This can be done by hand (its only a couple CDCcodes that need to be updated) but I am trying to limit the amount of hand work required so that this process can be easily reproducible in the future. So if you by chance know where the freshwaterfish.csv data comes from that would be awesome!

Let me know what you guys think I and I can make a PR. Thanks!

NewGraphEnvironment commented 1 month ago

just a note to let you know we recognize that the data munge better belongs in data-raw vs R directory and that can be done before a PR.

Option 3 could look a bit like https://github.com/lucy-schick/fishbc/issues/1 if we can't find the export for freshwaterfish.csv and we want to create a xref_sp_element_codes.csv to update freshwaterfish.csv using cdc.csv.

Absolutely no need to throw any time at this on our behalf as we have a working update in https://github.com/lucy-schick/fishbc/tree/updated_data

joethorley commented 1 month ago

Hi @lucy-schick and @NewGraphEnvironment

The freshwaterfish.csv file was hand punched based on available information.

Will respond to other questions tomorrow!

Thank

Joe

NewGraphEnvironment commented 1 month ago

The taxonomic info looked familiar and woke me up in the night (ha) so I tested retrieving it from ITIS using the taxize package from some ancient code from a benthic invertebrate project. It looks good for retrieving a big part of the freshwaterfish dataframe. Not sure that taxonomic info is necessary but the ITIS IDs seem like a great way to be consistent with names and script taxonomy. Here is that test code that should run in unaltered repo now that I am actually putting it down. code comes from private repo here https://github.com/NewGraphEnvironment/Sheep/blob/master/R/01b_load_invert.R

library(tidyverse)
library(taxize)

cdc <- readr::read_csv("data-raw/cdc/cdc.csv")
freshwaterfish <- readr::read_csv("data-raw/freshwaterfish/freshwaterfish.csv")

names_resolved <- tibble::as_tibble(taxize::gnr_resolve(unique(cdc$`Scientific Name`), 
                                                        data_source_ids = c(3),
                                                        canonical = TRUE, with_context = TRUE,
                                                        best_match_only = TRUE))

cdc_matched <- dplyr::left_join(
  cdc, 
  names_resolved |> select(user_supplied_name, matched_name2), 
  by = c("Scientific Name" = "user_supplied_name"))

# takes a long time so get ids for just 10 as a test
ids <-  taxize::get_ids(unique(cdc_matched$matched_name2[1:10]),
                        db = 'itis')

ids_out <- as_tibble(ids$itis)

classed_raw <- taxize::classification(unique(ids_out$ids), 
                                      db = 'itis')

clean_classification <- function(x){
  a <- x %>% 
    select(rank, name, id) %>% 
    t() %>% 
    as.data.frame() 
  b <- a %>% 
    set_names(nm = unlist(slice(a,1))) %>% 
    as_tibble(.name_repair = ~ make.names(.x, unique = TRUE)) %>% ##this fixes the names if it is an issue
    slice(-1,-3)
}

classed_df <- classed_raw %>% 
  map_df(clean_classification, .id = 'Name')  ##clean up the results and join dataframes together

setdiff(
  names(freshwaterfish |> janitor::clean_names()), 
  names(classed_df)
  )

#looks like subspecies is not there and that ITIS includes that info in species.  The rest of the columns seem as though they could be derived from either cdc exports, "data-raw/ab/ep-fwmis-fisheries-loadform.xls" or "data-raw/whse_fish_species_cd/whse_fish_species_cd.csv"

joethorley commented 1 month ago

Hi @lucy-schick and @NewGraphEnvironment

At a high level I think the best approach is to save the new cdc download as a file called raw.csv in data-raw/cdc and then use a script called cdc.R in data-raw/cdc to wrangle it into a form that is closer to the current cdc.csv file in data-raw/cdc. It should overwrite the current cdc.csv file. Columns should be renamed to the current name in cdc.csv and organized in the same order and columns formatted to be consistent with the current formatting so that existing code using fishbc isn't broken. Obviously new fish should be added and fish that are no longer in cdc.csv removed and this also goes for particular values in cells. It is also ok to add new columns if they provide useful information. The new columns should be added to the end of the data frame. Finally the rows should be sorted alphabetically by Scientific Name.

I'll review your proposed options for updating freshwaterfish.csv now

Thanks

Joe

joethorley commented 1 month ago

With regard the freshwaterfish.csv file the taxize package looks really cool but we understand getting every species correct could be a lot of work. We'll accept a hand edited version or a version that is created by a script from one or more files. Another approach is to hand edit and have a script to identify discrepancies/missing etc for review purposes and a notes column to explain why each discrepancy/missing is ok. The notes column would be stripped out for the final data set.

Thoughts?

lucy-schick commented 1 month ago

Thanks for the feedback. I will make a PR with the criteria you mentioned above soon.

As for the freshwaterfish file, I think having a notes column to keep track of the discrepancies/missing species is a great idea. I will think about this some more and get back to you.

Thanks