ropensci / rebird

Wrapper to the eBird API
https://docs.ropensci.org/rebird
Other
82 stars 17 forks source link

Add ebirdchecklist() #108

Closed RichardLitt closed 6 months ago

RichardLitt commented 8 months ago

Unfortunately deleted #107 by accident misusing force push, while trying to squash. This is another attempt at fixing #93.

slager commented 8 months ago

Thanks for the work on this, @RichardLitt. Yeah I think force pushing can mess up PRs since GitHub may be trying to reference specific previous commits internally in PRs, and force pushing removes those.

If you see R CMD check errors in the CI jobs you can often also troubleshoot these locally with devtools::check()

RichardLitt commented 8 months ago

I think I got it. I didn't know I needed to run devtools::document() as part of the build.

RichardLitt commented 8 months ago

Let me know if anything needs editing.

slager commented 8 months ago

I took this for a spin locally and it's a great start, I have a few suggestions

  1. In the tests, expect_error(ebirdchecklist(invalid_checklist_id)) fails because this does not actually throw an error. Instead, it returns a data.frame containing the error description. There should be some code that checks the JSON to see if it is an error result, and throws an actual error there (e.g. with stop()).
  2. There should be some discussion with @sebpardo on the structure of the result, since the API returned object is rather complex. Currently it's returned from ebirdchecklist() as a nested tibble, where things like obs, obsAux, and subAux are returned nested. Is that what we want and is it consistent with the rest of the package outputs? Should these all be unnested into a wide data.frame? Or should the result be unnested and returned as a list containing a checklist-level table and an observation-level table? The latter might be the most tidy. Additional tests are probably needed to check the output structure once that's decided.
  3. The documentation doesn't include any description of what the function returns. Once item number 2 is addressed, this could be fleshed out, see other functions in the package for examples.
  4. @sebpardo?
RichardLitt commented 8 months ago

Let me know what you think makes sense for the output. Happy to change as needed.

RichardLitt commented 7 months ago

Small note: @sebpardo, I'm waiting on you for thoughts on this before moving forward. Let me know what you think!

sebpardo commented 7 months ago

Hi @RichardLitt, apologies for the delay in looking into this. I just had played around with it locally and it looks great!

I agree with @slager that we need to think about how to return the output of this API call as the call itself returns a complex object. Saying this, I think it can be easily flattened into a data frame even in the special case of a checklist with breeding codes and/or attached media. See the following example:

library(dplyr)
cl <- ebirdchecklist("S101576784")
glimpse(cl)
## Rows: 17
## Columns: 22
## $ projId                      <chr> "EBIRD", "EBIRD", "EBIRD", "EBIRD", "EBIRD", "E…
## $ subId                       <chr> "S101576784", "S101576784", "S101576784", "S101…
## $ protocolId                  <chr> "P22", "P22", "P22", "P22", "P22", "P22", "P22"…
## $ locId                       <chr> "L17570646", "L17570646", "L17570646", "L175706…
## $ durationHrs                 <dbl> 3.367, 3.367, 3.367, 3.367, 3.367, 3.367, 3.367…
## $ allObsReported              <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ creationDt                  <chr> "2022-01-28 10:01", "2022-01-28 10:01", "2022-0…
## $ lastEditedDt                <chr> "2023-03-17 15:49", "2023-03-17 15:49", "2023-0…
## $ obsDt                       <chr> "2022-01-23 17:02", "2022-01-23 17:02", "2022-0…
## $ obsTimeValid                <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ checklistId                 <chr> "CL30239", "CL30239", "CL30239", "CL30239", "CL…
## $ numObservers                <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ effortDistanceKm            <dbl> 1.9, 1.9, 1.9, 1.9, 1.9, 1.9, 1.9, 1.9, 1.9, 1.…
## $ effortDistanceEnteredUnit   <chr> "km", "km", "km", "km", "km", "km", "km", "km",…
## $ subnational1Code            <chr> "CL-VS", "CL-VS", "CL-VS", "CL-VS", "CL-VS", "C…
## $ submissionMethodCode        <chr> "EBIRD_android", "EBIRD_android", "EBIRD_androi…
## $ submissionMethodVersion     <chr> "2.8.8_SDK28", "2.8.8_SDK28", "2.8.8_SDK28", "2…
## $ userDisplayName             <chr> "Sebastián Pardo", "Sebastián Pardo", "Sebastiá…
## $ numSpecies                  <int> 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17,…
## $ submissionMethodVersionDisp <chr> "2.8.8", "2.8.8", "2.8.8", "2.8.8", "2.8.8", "2…
## $ subAux                      <df[,4]> <data.frame[17 x 4]>
## $ obs                         <df[,13]> <data.frame[17 x 13]>
glimpse(cl$subAux)
## Rows: 17
## Columns: 4
##  $ subId           <chr> "S101576784", "S101576784", "S101576784", "S101576784", "S1…
##  $ fieldName       <chr> "nocturnal", "nocturnal", "nocturnal", "nocturnal", "noctur…
##  $ entryMethodCode <chr> "ebird_nocturnal", "ebird_nocturnal", "ebird_nocturnal", "e…
##  $ auxCode         <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0",…
glimpse(cl$obs)
## Rows: 17
## Columns: 13
##  $ speciesCode      <chr> "ameoys", "whimbr", "kelgul", "perboo1", "neocor", "perpel…
##  $ hideFlags        <list> [], [], [], [], [], [], [], [], [], [], [], [], [], [], […
##  $ obsDt            <chr> "2022-01-23 17:02", "2022-01-23 17:02", "2022-01-23 17:02"…
##  $ subnational1Code <chr> "CL-VS", "CL-VS", "CL-VS", "CL-VS", "CL-VS", "CL-VS", "CL-…
##  $ howManyAtleast   <int> 2, 1, 20, 50, 3, 25, 2, 2, 1, 1, 2, 2, 2, 2, 1, 3, 4
##  $ howManyAtmost    <int> 2, 1, 20, 50, 3, 25, 2, 2, 1, 1, 2, 2, 2, 2, 1, 3, 4
##  $ subId            <chr> "S101576784", "S101576784", "S101576784", "S101576784", "S…
##  $ projId           <chr> "EBIRD", "EBIRD", "EBIRD", "EBIRD", "EBIRD", "EBIRD", "EBI…
##  $ obsId            <chr> "OBS1328796217", "OBS1328796219", "OBS1328796215", "OBS132…
##  $ howManyStr       <chr> "2", "1", "20", "50", "3", "25", "2", "2", "1", "1", "2", …
##  $ present          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
##  $ mediaCounts      <df[,1]> <data.frame[17 x 1]>
##  $ obsAux           <list> <NULL>, <NULL>, <NULL>, [<data.frame[1 x 7]>], <NULL>, <NU…
glimpse(cl$obs$obsAux)
## List of 17
##  $ : NULL
##  $ : NULL
##  $ : NULL
##  $ :'data.frame':    1 obs. of  7 variables:
##   ..$ subId          : chr "S101576784"
##   ..$ fieldName      : chr "breeding_code"
##   ..$ entryMethodCode: chr "api_breeding_code"
##   ..$ auxCode        : chr "FL"
##   ..$ obsId          : chr "OBS1328796224"
##   ..$ speciesCode    : chr "perboo1"
##   ..$ value          : chr "FL"
##  $ : NULL
##  $ : NULL
##  $ : NULL
##  $ : NULL
##  $ : NULL
##  $ : NULL
##  $ : NULL
##  $ : NULL
##  $ : NULL
##  $ : NULL
##  $ : NULL
##  $ : NULL
##  $ : NULL
glimpse(cl$obs$mediaCounts)
## Rows: 17
## Columns: 1
## $ P <int> NA, NA, NA, 2, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA

The nested data frames .$subAux and .$obs can be easily joined to the main data frame as they have the same number of rows (Hopefully this is always the case!). Also note that we can simplify what we return and remove several columns as they are either duplicates or obsolete according to eBird's API documentation on this request:

In the fields for each observation, the following fields are duplicates or obsolete and will be removed at a future date: howManyAtleast, howManyAtmost, hideFlags, projId, subId, subnational1Code and present.

I also think we should ignore the columns beginning with submissionMethod* given they're about the observers' hardware instead of birds, and perhaps everything in .$subAux as well given it's not very informative and potentially wrong (I submitted that checklist today in the day but that output says it's a nocturnal checklist...).

With regards to the breeding code, I think the easiest solution would be to have an argument in the function along the lines of breeding_codes = TRUE, for which extra columns are returned with some the output of .$obs$obsAux, perhaps a column names breeding_code with a value equal to .$obs$obsAux$value? (and returning NAs for the species without that data). We could do the same with .$obs$mediaCounts, although I wouldn't be opposed to ignoring this output.

Haven't looked deep into the tests yet, but I tried entering two checklist identifiers and the function should throw an error with a more informative message:

cl2 <- ebirdchecklist(c("S162668032", "S162668033"))
### Error in parse_url(url) : length(url) == 1 is not TRUE

Lastly, I also agree with @slager in that the documentation needs to provide information on every column being returned.

Despite my long response, I don't think it should be too onerous to sort this out! @RichardLitt, let us know if you'd like any help with these suggestions.

slager commented 7 months ago

Here's an attempt to wrangle the returned object into a clean output data.frame:

cl <- ebirdchecklist("S101576784")

# extract sub df
col_is_df <- vapply(cl, is.data.frame, TRUE)
sub_df <- cl[1, !col_is_df]

# extract subAux df
subAux_df <- cl$subAux[1,]
# seems empty, and names conflict with breeding codes
subAux_df$auxCode <- NULL
subAux_df$entryMethodCode <- NULL

# extract obsAux df
obsAux_list <- cl$obs$obsAux
# find the list entry that contains the data
col_is_df <- vapply(obsAux_list, is.data.frame, TRUE)
obsAux_df <- obsAux_list[[which(col_is_df)]]
# redundant columns from sub_df
obsAux_df$subId <- NULL
obsAux_df$speciesCode <- NULL
# duplicate info with uninformative name
obsAux_df$value <- NULL
# names conflict with sub_df, and not very important
obsAux_df$fieldName <- NULL
obsAux_df$entryMethodCode <- NULL

# extract obs df
obs_df <- cl$obs
obs_df$obsAux <- NULL
# hideFlags might be useful, but its structure is currently undocumented
obs_df$hideFlags <- NULL
# remove redundant sub-level columns already in sub_df
obs_df$subnational1Code <- NULL
obs_df$obsDt <- NULL
obs_df$projId <- NULL
# mediaCounts appears to just be a nested integer vector (?)
obs_df$mediaCounts <- Reduce(c, obs_df$mediaCounts)

# join to get result df
sub_df %>%
  dplyr::left_join(subAux_df, by = 'subId') %>%
  dplyr::left_join(obs_df, by = 'subId') %>%
  dplyr::left_join(obsAux_df, by = 'obsId')

If something like this looks like a good start, I'm happy to put in a PR to ropensci/rebird once Richard's PR gets merged.

Note @RichardLitt that @sebpardo bumped the version again on master, so you'll want to merge recent changes to master back into your development branch.

RichardLitt commented 7 months ago

If something like this looks like a good start, I'm happy to put in a PR to ropensci/rebird once Richard's PR gets merged.

Thanks, @slager. I think you'd still like me to make the above changes that @sebpardo suggested though, correct?

Thanks, @sebpardo, for the long response. One thing:

I also think we should ignore the columns beginning with submissionMethod* given they're about the observers' hardware instead of birds, and perhaps everything in .$subAux as well given it's not very informative and potentially wrong (I submitted that checklist today in the day but that output says it's a nocturnal checklist...).

I'm not sure it's a good practice to remove anything we get from the API in our wrapper. I think we should include it, even if it's potentially wrong, because it's what is gotten from the API and it'd be confusing to users to have that information not be there.

I'm happy to implement these changes, but I won't get to it this week, at least.

slager commented 7 months ago

@RichardLitt Yes, re: implementing the suggestions from @sebpardo

RichardLitt commented 6 months ago

I think I solved the issue with https://github.com/ropensci/rebird/pull/108#issuecomment-1925397146 problem 1. @sebpardo, it wasn't clear to me if you wanted to implement all of the changes mentioned in 2, as you started to in https://github.com/ropensci/rebird/pull/108#issuecomment-1963027124. What do you think?

slager commented 6 months ago

Great! I will try to find some time to take another look at this soon.

slager commented 6 months ago

Thanks much @RichardLitt, merging! Feel free to test and tweak this further, @sebpardo.

sebpardo commented 6 months ago

Thank you, @RichardLitt, looks great!

RichardLitt commented 6 months ago

Thank you, @sebpardo and @slager! I appreciate the extra effort as I waded through this.

We're using this endpoint as eBird admins to find checklists that slip past the normal filters, but which shouldn't be in the database due a variety of errors. Hopefully this addition will help us find those easier. Thank you.