nc-minibbs / mbbs

A repository for the Mini-Bird Breeding Survey data
https://minibbs.us
Other
2 stars 0 forks source link

towards a working import process #15

Closed bsaul closed 2 years ago

bsaul commented 2 years ago

TODO

bsaul commented 2 years ago

@ahhurlbert - I have a draft import process in this MR. For the most part, the scripts I had were moved to R functions (with the exception of the code that scraped the website). Then the import process boils down to:

library(magrittr)
library(mbbs)
mbbs_orange <-
  import_ebird_data('inst/extdata/MyEBirdData_Orange_20210316.csv') %>%
  prepare_mbbs_data(
    # mbbs_site_20190127.rds contains website data for Orange only
    mbbs_site_dt = readRDS("inst/extdata/mbbs_site_20190127.rds")  
  ) %>%
  combine_site_ebird()

That said, there are still a few hard-coded assumptions in the functions in the R/ directory (that I tried to label with TODO or a clear comment). I'm sure I missed a few.

Also, we're hitting issues with misspelling on location names. For example, I extract the mbbs county using the regex "[Oo]range|[Cc]hatham|[Dd]urham", but submission S89832222 has "MBBS, Chatman, Route 1-9" as the location name. We have the options of either having people clean these up or play whack-a-mole with various spellings. What are your thoughts?

I've found a number of duplicate submission for route 1 for a number of years; e.g.:

1 S6659580   2009         1
 2 S6659666   2009         1
 3 S10953436  2012         1
 4 S11003090  2012         1
 5 S14515661  2013         1
 6 S14248552  2013         1
 7 S29713521  2016         1
 8 S30189791  2016         1
 9 S37213429  2017         1
10 S37523618  2017         1

Can you help clean these up? To that end, there is a inst/excluded_submissions.yml file in which to put submissions to exclude. The format should be obvious.

Toy around with the functions and let me know what you think.

bsaul commented 2 years ago

Also, I found a chatham county submission in the orange county account.

> import_ebird_data('inst/extdata/MyEBirdData_Orange_20210917.csv') %>% filter(mbbs_county == "chatham") %>% distinct(sub_id)
# A tibble: 1 x 1
  sub_id   
  <chr>    
1 S89966940

Do we want people to clean these up or do we want to do the consistency checks once all the counties are combined. Currently, prepare_mbbs_data fails if there is more than one unique value for mbbs_county in the ebird data. I think it's cleaner to handle this in ebird and not in the data processing.