ropensci / software-review

rOpenSci Software Peer Review.
291 stars 104 forks source link

auk: eBird Data Extraction with AWK #136

Closed mstrimas closed 6 years ago

mstrimas commented 7 years ago

Summary

Access to the eBird database, consisting of over 400 million observations, is provided via a huge (>150 GB) text file. The auk package extracts records from this file and imports them into R for analysis. Both presence only and presence/absence data can be generated.

Package: auk
Title: eBird Data Extraction with AWK
Version: 0.0.2.900
Date: 2017-07-05
Authors@R: c(
  person("Matthew", "Strimas-Mackey", email = "mes335@cornell.edu", role = c("aut", "cre")),
  person("Eliot", "Miller", role = "aut"),
  person("Wesley", "Hochachka", role = "aut"),
  person("Cornell Lab of Ornithology", role = "cph")
  )
URL: https://github.com/CornellLabofOrnithology/auk, http://CornellLabofOrnithology.github.io/auk/
BugReports: https://github.com/CornellLabofOrnithology/auk/issues
Description: Extract and process bird sightings records from eBird 
    (<http://ebird.org>), an online tool for recording bird observations. 
    Public access to the full eBird database is via the eBird Basic Dataset 
    (EBD; see <http://ebird.org/ebird/data/download> for access), a downloadable 
    text file. This package is an interface to AWK for extracting data from the 
    EBD based on taxonomic, spatial, or temporal filters, to produce a 
    manageable file size that can be imported into R.
Depends: R (>= 3.1.0)
License: GPL-3
Encoding: UTF-8
LazyData: true
Imports:
    assertthat,
    stringr,
    stringi,
    magrittr,
    countrycode,
    tidyr
RoxygenNote: 6.0.1
Roxygen: list(markdown = TRUE)
Suggests:
    readr,
    data.table,
    knitr,
    rmarkdown,
    testthat,
    covr
VignetteBuilder: knitr

This package falls somewhere at the intersection of data retrieval and extraction. It provides access to the eBird database; however, it does so by processing a text file downloaded from eBird that contains the full database.

Anyone looking to work with eBird data for science or conservation.

rebird provides access to eBird data via the eBird API; however, this only gives access to last 30 days of data. This package is the only one giving access to full eBird database.

Requirements

Confirm each of the following by checking the box. This package:

Publication options

Detail

karthik commented 7 years ago

Hi @mstrimas, Thank you for the submission. Sorry for the delay, but I am doing some initial editorial checks before locating suitable reviewers.

Editor checks:

Here are some notes from a goodpractice::gp() check (Please do run it yourself as you fix these issues or explain why you are unable to fix them).

It is good practice to

  ✖ write unit tests for all functions, and all package code
    in general. 80% of code lines are covered by test cases.

    R/auk-clean.r:50:NA
    R/auk-clean.r:51:NA
    R/auk-clean.r:52:NA
    R/auk-clean.r:53:NA
    R/auk-clean.r:54:NA
    ... and 161 more lines

  ✖ omit "Date" in DESCRIPTION. It is not required and it
    gets invalid quite often. A build date will be added to the package
    when you perform `R CMD build` on it.
  ✖ use '<-' for assignment instead of '='. '<-' is the
    standard, and R users and developers are used it and it is easier
    to read your code for them if you use '<-'.

    R/utils.r:20:13
    R/utils.r:74:39
    R/utils.r:75:36
    R/utils.r:77:15
    R/utils.r:83:39

  ✖ fix this R CMD check WARNING: LaTeX errors when creating
    PDF version. This typically indicates Rd problems.
  ✖ fix this R CMD check ERROR: Re-running with no
    redirection of stdout/stderr. Hmm ... looks like a package You may
    want to clean up by 'rm -rf /tmp/Rtmp6kBa0r/Rd2pdf2a896cf4bd4a'
──────────────────────────────────────────────────────────────────────────────── 
Warning messages:
1: In readLines(filename) :
  incomplete final line found on '/root/foo/auk/R/auk-rollup.r'
2: In readLines(filename) :
  incomplete final line found on '/root/foo/auk/R/read.r'
3: In readLines(filename) :
  incomplete final line found on '/root/foo/auk/tests/testthat/test_auk-rollup.r'
4: In readLines(filename) :
  incomplete final line found on '/root/foo/auk/tests/testthat/test_ebird-species.r'
mstrimas commented 7 years ago

Thanks, @karthik, I just fixed several of these, but the following remain:

  ✖ write unit tests for all functions, and all package code in general. 80% of code lines are covered
    by test cases.

    R/auk-clean.r:50:NA
    R/auk-clean.r:51:NA
    R/auk-clean.r:52:NA
    R/auk-clean.r:53:NA
    R/auk-clean.r:54:NA
    ... and 161 more lines

  ✖ fix this R CMD check WARNING: LaTeX errors when creating PDF version. This typically indicates Rd
    problems.
  ✖ fix this R CMD check ERROR: Re-running with no redirection of stdout/stderr. Hmm ... looks like a
    package You may want to clean up by 'rm -rf
    /var/folders/mg/qh40qmqd7376xn8qxd6hm5lwjyy0h2/T//RtmpsPAVpf/Rd2pdf5dcb4abe733d'
karthik commented 7 years ago

@mstrimas No worries. Thank you for fixing the warnings. Regarding those warnings, have you tried adding a blank line at the end to those files. That should make the warnings go away.

karthik commented 7 years ago

Reviewer 1 is @aurielfournier Review due: August 23 (Auriel noted that she might need an additional week due to travel)

karthik commented 7 years ago

Reviewer 2 is @emhart Review due: August 27

aurielfournier commented 7 years ago

Documentation

The package includes all the following forms of documentation:

Functionality

Final approval (post-review)

Estimated hours spent reviewing: 6 (this is my first package review so I spent more time then I suspect I might on future reviews)


Reviewer Comments

auk does a great job of removing much of the pain and frustration of working with raw eBird data, which has been a limiting factor for many who want to take advantage of the vast data resources available through eBird. While there are other eBird packages, this is the only one I am aware of that allows you to work with the raw date downloaded from Cornell, as opposed to working with the summary data that can be gleaned from the eBird website.

The package is solid, though since I am not well versed in AWK by any means I'm unable to comment on those fine details.

The vignette is quite extensive, which is fantastic!

I'm reading this vignette thinking about 'the average ebird data user' who isn't necessarily someone with a extensive R background and so while this is a very detailed vignette, and I think the detail is good and important, it might be better if it was rearranged so that the heavy technical detail was towards the end, and the 'how to use this package' is more up front. Since heavy users will keep reading, but less experienced users may get overwhelmed by details that are not essential to them using the package.

My biggest suggestion would be to remove unlink() from all the function help examples and the vignette. If the user just runs the whole chunk at the same time, like I did the first several times, then the output file isn't there, since R just created it and then deleted it. I understand why you have it there to avoid having lots of files in your own directory, but I think keeping unlink() there it will create more issues then it solves, especially for less experienced R users.

I would encourage you to avoid abbreviations since most people aren't going to read the vignette word for word, and consider not using EBD, and just saying 'basic dataset' or something along those lines instead. It will be much more readable/skim-able this way.

Throughout the vignette and function documentation you use pipes, which is great, I like pipes, but lots of people don't. In some cases because they don't like them and in others because they find them confusing. I think it would be valuable to also include examples of how the functions would be used without pipes in the vignette and in the function specific help files.

Since it is not good practice to write over an object with the same name ebd, I would suggest editing your example to not do this, as it could cause issues for people running the examples piece meal and not following every step.

Function specific feedback

The function help examples in the different filter functions don't include auk_filter at the end of the pipeline. I think it would make sense to include auk_filter in all the examples since you mention in the function description that you need to include auk_filter to finish the process. That way the example is demonstrating the function within its full context.

Build/Install

I don't check build/installation on things very often. So this is not going to be the high point of my review. devtools::check() returned the following. If I am understanding this correctly there aren't any major issues on my machine.

Updating auk documentation
Loading auk
Setting env vars -----------------------------------------
CFLAGS  : -Wall -pedantic
CXXFLAGS: -Wall -pedantic
Building auk ---------------------------------------------
"C:/PROGRA~1/R/R-34~1.1/bin/x64/R" --no-site-file  \
  --no-environ --no-save --no-restore --quiet CMD build  \
  "C:\Users\amf698\Documents\R\win-library\3.4\auk"  \
  --no-resave-data --no-manual 

* checking for file 'C:\Users\amf698\Documents\R\win-library\3.4\auk/DESCRIPTION' ... OK
* preparing 'auk':
* checking DESCRIPTION meta-information ... OK
* checking whether 'INDEX' is up-to-date ... NO
* use '--force' to remove the existing 'INDEX'
* excluding invalid files
Subdirectory 'R' contains invalid file names:
  'auk' 'auk.rdb' 'auk.rdx'
* checking for LF line-endings in source and make files
* checking for empty or unneeded directories
Removed empty directory 'auk/R'
Removed empty directory 'auk/man'
WARNING: Removing directory 'auk/Meta' which should only occur in an
  installed package
WARNING: Removing directory 'auk/help' which should only occur in an
  installed package
WARNING: Removing directory 'auk/html' which should only occur in an
  installed package
* looking to see if a 'data/datalist' file should be added
* building 'auk_0.0.2.tar.gz'

Setting env vars -----------------------------------------
_R_CHECK_CRAN_INCOMING_ : FALSE
_R_CHECK_FORCE_SUGGESTS_: FALSE
Checking auk ---------------------------------------------
"C:/PROGRA~1/R/R-34~1.1/bin/x64/R" --no-site-file  \
  --no-environ --no-save --no-restore --quiet CMD check  \
  "C:\Users\amf698\AppData\Local\Temp\RtmpOCSZT8/auk_0.0.2.tar.gz"  \
  --as-cran --timings --no-manual 

* using log directory 'C:/Users/amf698/AppData/Local/Temp/RtmpOCSZT8/auk.Rcheck'
* using R version 3.4.1 (2017-06-30)
* using platform: x86_64-w64-mingw32 (64-bit)
* using session charset: ISO8859-1
* using options '--no-manual --as-cran'
* checking for file 'auk/DESCRIPTION' ... OK
* this is package 'auk' version '0.0.2'
* package encoding: UTF-8
* checking package namespace information ... OK
* checking package dependencies ... NOTE
Package suggested but not available for checking: 'covr'
* checking if this is a source package ... ERROR
Only *source* packages can be checked.
* DONE

Status: 1 ERROR, 1 NOTE
See
  'C:/Users/amf698/AppData/Local/Temp/RtmpOCSZT8/auk.Rcheck/00check.log'
for details.

R CMD check results
1 error  | 0 warnings | 1 note 
checking if this is a source package ... ERROR
Only *source* packages can be checked.

checking package dependencies ... NOTE
Package suggested but not available for checking: 'covr'
emhart commented 7 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Functionality

Final approval (post-review)

Estimated hours spent reviewing: 5

Review Comments

The authors present a very elegant solution to a difficult problem in R, how to handle a very large data set such as the eBird data (larger than most people's personal computers could load into RAM) when most users only need a subset of the data in the full file. Their solution is to provide a way to automatically build an Awk script, execute it, and write a new output file. While this could be done without this R package, they make the data accessible to a much larger audience.

Overall I found the code to very comprehensive and elegant and think it will be a good addition to the rOpenSci package suite. The authors largely adhere to the rOpenSci package guidlines and are exceedingly diligent in their error handling in each function. Also I was impressed with their coverage use cases in their tests. They went above in beyond in writing an exhaustive suite of tests for each function. I find no major issues with how their code is written.

I do think there are a couple minor areas for improvement in making it easier for end users. The biggest issue I had was initially grokking that this was a multi-step workflow that involved writing a file to disk. My first impression was that I could simlply run a bunch of filters and the ebd variable (in the README) would actually be a dataframe. If there was a way to make the work flow more explicit, especially in the README, I think that would be helpful. Another thought I have is, would it be possible to obfuscate this multi-step process and have a function that loads up the ebd, runs the filters, writes the file and reads it all into a tibble? That way the end user could side-step ever running read_ebd(). Anoher minor issue I had was that there's somewhat mixed handling of what I think of as "user standards laziness". For instance, you insist on ISO date standards, but countries can be mixed case, and don't require the ISO country code. I see how on the one hand you're making it easier for users, but I found myself a bit confused about where I could skip on my standards when it came to input. I was honestly surprised when I could ender "gray jay" but not "Robin" (but "American Robin" worked fine).

Minor comments

Community guidelines

A CONTRIBUTING or way to cotribute in the README is not present. Consider adding contributor guidelines.

Examples

I ran all examples using devtools::run_examples and all ran without error.

Tests

I ran all tests with devtools::test() and all tests were passed.

Checks

I built the package on the following system using devtools::test(cran = TRUE): R version 3.4.1 (2017-06-30) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Sierra 10.12.6

All checks were passed with no notes, errors, or warninging.

Test coverage

I checked for the amount of test coverage using covr::package_coverage() and it was 80.9%

Furthermore I reviewed all the tests in tests/testthat, not only was there good test coverage, the range of scenarios was exhaustive. I was very impressed with the breadth of cases tested.

sessionInfo()

Just so you can see what versions of packages I used to run my tests:

Session info ----------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.1 (2017-06-30)
 system   x86_64, darwin15.6.0        
 ui       RStudio (1.0.153)           
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/Los_Angeles         
 date     2017-08-23                  

Packages --------------------------------------------------------------------------------------------
 package      * version   date       source                                   
 assertthat     0.2.0     2017-04-11 CRAN (R 3.4.1)                           
 auk          * 0.0.2.901 <NA>       local                                    
 backports      1.1.0     2017-05-22 CRAN (R 3.4.1)                           
 base         * 3.4.1     2017-07-07 local                                    
 bindr          0.1       2016-11-13 CRAN (R 3.4.1)                           
 bindrcpp     * 0.2       2017-06-17 CRAN (R 3.4.1)                           
 callr          1.0.0     2016-06-18 CRAN (R 3.4.0)                           
 clisymbols     1.2.0     2017-08-24 Github (gaborcsardi/clisymbols@e49b4f5)  
 commonmark     1.2       2017-03-01 CRAN (R 3.4.1)                           
 compiler       3.4.1     2017-07-07 local                                    
 countrycode    0.19      2017-02-06 CRAN (R 3.4.0)                           
 covr         * 3.0.0     2017-06-26 CRAN (R 3.4.1)                           
 crayon         1.3.2     2016-06-28 CRAN (R 3.4.1)                           
 cyclocomp      1.1.0     2017-08-24 Github (MangoTheCat/cyclocomp@6156a12)   
 data.table     1.10.4    2017-02-01 CRAN (R 3.4.0)                           
 datasets     * 3.4.1     2017-07-07 local                                    
 desc           1.1.1     2017-08-03 CRAN (R 3.4.1)                           
 devtools     * 1.13.3    2017-08-02 CRAN (R 3.4.1)                           
 digest         0.6.12    2017-01-27 CRAN (R 3.4.1)                           
 dplyr          0.7.2     2017-07-20 CRAN (R 3.4.1)                           
 evaluate       0.10.1    2017-06-24 CRAN (R 3.4.1)                           
 glue           1.1.1     2017-06-21 CRAN (R 3.4.1)                           
 goodpractice   1.0.0     2017-08-24 Github (MangoTheCat/goodpractice@9969799)
 graphics     * 3.4.1     2017-07-07 local                                    
 grDevices    * 3.4.1     2017-07-07 local                                    
 hms            0.3       2016-11-22 CRAN (R 3.4.0)                           
 httr           1.3.1     2017-08-20 CRAN (R 3.4.1)                           
 igraph         1.1.2     2017-07-21 CRAN (R 3.4.1)                           
 jsonlite       1.5       2017-06-01 CRAN (R 3.4.1)                           
 knitr          1.17      2017-08-10 CRAN (R 3.4.1)                           
 lazyeval       0.2.0     2016-06-12 CRAN (R 3.4.0)                           
 lintr          1.0.1     2017-08-10 CRAN (R 3.4.1)                           
 magrittr       1.5       2014-11-22 CRAN (R 3.4.1)                           
 memoise        1.1.0     2017-04-21 CRAN (R 3.4.1)                           
 methods      * 3.4.1     2017-07-07 local                                    
 pkgconfig      2.0.1     2017-03-21 CRAN (R 3.4.1)                           
 praise         1.0.0     2015-08-11 CRAN (R 3.4.0)                           
 purrr          0.2.3     2017-08-02 CRAN (R 3.4.1)                           
 R6             2.2.2     2017-06-17 CRAN (R 3.4.1)                           
 rcmdcheck      1.2.1     2016-09-28 CRAN (R 3.4.0)                           
 Rcpp           0.12.12   2017-07-15 CRAN (R 3.4.1)                           
 readr          1.1.1     2017-05-16 CRAN (R 3.4.0)                           
 remotes        1.1.0     2017-07-09 CRAN (R 3.4.1)                           
 rex            1.1.1     2016-12-05 CRAN (R 3.4.0)                           
 rlang          0.1.2     2017-08-09 CRAN (R 3.4.1)                           
 roxygen2       6.0.1     2017-02-06 CRAN (R 3.4.1)                           
 rprojroot      1.2       2017-01-16 CRAN (R 3.4.1)                           
 rstudioapi     0.6       2016-06-27 CRAN (R 3.4.1)                           
 stats        * 3.4.1     2017-07-07 local                                    
 stringi        1.1.5     2017-04-07 CRAN (R 3.4.1)                           
 stringr        1.2.0     2017-02-18 CRAN (R 3.4.1)                           
 testthat     * 1.0.2     2016-04-23 CRAN (R 3.4.0)                           
 tibble         1.3.4     2017-08-22 CRAN (R 3.4.1)                           
 tidyr          0.7.0     2017-08-16 CRAN (R 3.4.1)                           
 tools          3.4.1     2017-07-07 local                                    
 utils        * 3.4.1     2017-07-07 local                                    
 whoami         1.1.1     2015-07-13 CRAN (R 3.4.0)                           
 withr          2.0.0     2017-07-28 CRAN (R 3.4.1)                           
 xml2           1.1.1     2017-01-24 CRAN (R 3.4.1)                           
 xmlparsedata   1.0.1     2016-06-18 CRAN (R 3.4.0)      
emhart commented 7 years ago

@mstrimas As an aside from my review I wanted to say that this is a really cool solution to a big problem in R that I actually encounter in my work often. I might have a dataset that's a 20-30 GB and I don't want to actually crunch the whole thing in R. So I do a slightly more hacky approach which is to do some filtering in the data export phase (in SQL) and then some basic shell commands to sample / trim it down more, and then read it into R to do things like model POC. Do you think there's a way to make this package completely generic?

I'm imagining a scenario where I input the file location, column header names, a series of generic filters, and then the same basic workflow happens, awk executes the script and then writes an output file. Then this same workflow could work on any large text file. It seems like that would be a really powerful tool that would extend the functionality of this approach beyond eBird. Do you think that would be feasible?

mstrimas commented 7 years ago

Thanks for all the helpful feedback! I'll start working through your suggestions and incorporating them.

@emhart yes, I think there is potential to make a more general AWK package for working with large files. In fact, I did originally considering doing that first, then making auk depend on the more general package, but just didn't have the time. there may also be better options than AWK that I'm not aware of... in any case, I think it would be useful to have a tool for processing text files that are too large to handle directly in R.

karthik commented 7 years ago

@aurielfournier A gentle ping 🙏

karthik commented 7 years ago

@aurielfournier Sorry I totally missed that your review was above Ted's. My apologies.

aurielfournier commented 7 years ago

No problem @karthik ! Our reviews came in withing a few hours of each other, easy to miss. I appreciate the gentle reminder, those are often necessary to keep me on top of things.

mstrimas commented 7 years ago

Finally getting to this, here are responses to @aurielfournier comments:

Thanks for the code review!!!

mstrimas commented 7 years ago

Here are my responses to @emhart:

Thanks!!!

mstrimas commented 7 years ago

@emhart Just added ability to manually set awk path by setting the AWK_PATH environment variable in .Renviron. Should work on Mac or Windows, though I don't have a Windows machine to test.

aurielfournier commented 7 years ago

@mstrimas

The Quick Start is great, exactly what I was looking for.

I think one example is sufficient for the with and without pipes.

I see what you mean about auk_filter() now, I guess I could go either way on that one. Do you have thoughts @emhart ?

mstrimas commented 6 years ago

@aurielfournier I've added pipe-free examples to all functions for pipe haters.

@karthik what's the next step here?

The eBird taxonomy and EBD was just updated and my intention is to submit a new version of auk to CRAN in the next few days reflecting the taxonomy changes and the suggestion from the reviewers.

emhart commented 6 years ago

Sorry for the delay @mstrimas here are a few quick thoughts:

mstrimas commented 6 years ago

@emhart thanks! I like the idea of column subsetting and will start looking at the best way to implement that.

mstrimas commented 6 years ago

auk_filter() now has additional arguments keep and drop that users can use to specify which columns are output.

mstrimas commented 6 years ago

@karthik just released a new version to CRAN with most of the changes suggested by the review process included, as well as a variety of other new features and bug fixes. let me know what the next steps are to get this up on rOpenSci. Thanks!

karthik commented 6 years ago

@emhart @aurielfournier Could you two take a look at the recent updates and let me know if you are ready to sign off? 🙏

aurielfournier commented 6 years ago

I will do my best to get to this in the next two weeks. This are a bit swamped on my end at the moment.

karthik commented 6 years ago

Thank you @aurielfournier! much appreciated. 🙏

karthik commented 6 years ago

@emhart @aurielfournier gentle ping on signing off on this submission (or raising further issues). 🙏

aurielfournier commented 6 years ago

Oh crap. I'm so sorry. I'll do my best to get to this as soon as I can, but the federal shutdown is messing up my week majorly, and even if they do come back tomorrow its going to take a few days for things to even out again.

aurielfournier commented 6 years ago

I am happy to sign off on this package. Great job @mstrimas

mstrimas commented 6 years ago

Thanks @aurielfournier! @karthik how would you suggest proceeding given that @emhart seems to be AWOL?

karthik commented 6 years ago

@mstrimas I've dropped Ted an email and will hopefully hear back from him soon. Sorry for the delays.

emhart commented 6 years ago

I took a look @mstrimas happy to sign off as well! Thanks for incorporating the feedback, the package looks great.

karthik commented 6 years ago

Congrats on your package being accepted @mstrimas! 🎉 🎈 And a huge thanks to @aurielfournier and @emhart for their expertise and time on this review! 🙏

Here are your next steps:

[![](https://badges.ropensci.org/136_status.svg)](https://github.com/ropensci/onboarding/issues/136)

Please also add a footer to the bottom of your README

[![](http://www.ropensci.org/public_images/github_footer.png)](http://ropensci.org)

Once moved, please re-run all checks in preparation for submission to CRAN. I can help with this if you run into any issues.

Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/tech-notes/) or long-form post with more narrative about its development. ((https://ropensci.org/blog/). If you are, @stefaniebutland will be in touch about content and timing.

mstrimas commented 6 years ago

Hi @karthik, In my original conversations with @noamross we agreed to host the package on the Cornell Lab of Ornithology's GitHub page and have a read only mirror at rOpenSci. I've added the footer and badge to the readme; however, our intention is to keep links and CI pointing to our organizations page. What's the best way to proceed with setting up a read only mirror on rOpenSci? Thanks!

karthik commented 6 years ago

@mstrimas Hi Matt, understood. You can skip the transfer step and I'll look into the best way for setting up a mirror and get back to you with further details.

stefaniebutland commented 6 years ago

Hello @mstrimas. Congratulations on auk acceptance! We would love to host a post about it, so if you're interested, have a look at the editorial and technical info here https://github.com/ropensci/roweb2#contributing-a-blog-post and let me know if you are considering it.

mstrimas commented 6 years ago

@stefaniebutland sure, I'm happy to modify the vignette into a blog post. Probably won't be able to get to it for a week or two though.

stefaniebutland commented 6 years ago

No problem @mstrimas. I have Tues Feb 27 available for a post and we typically ask for a draft for review at least a week before the post date. What do you think about Tues Feb 20 to submit a draft via pull request?

mstrimas commented 6 years ago

That works for me, @stefaniebutland, thanks!

karthik commented 6 years ago

@mstrimas Just wanted to give you a quick update. We are still working out a good way to do the mirroring, but I'll let you know soon. Also I am about to travel for a bit, so I'll update the thread upon my return (late Feb). 🙏

mstrimas commented 6 years ago

Hi @stefaniebutland, looks like there's going to be a large update to the data underlying this package in mid March that will requiring some changes to the package that break backward compatibility. Would you be open to pushing the blog post back until the new version is released? If this messes things up for you, no worries, I can proceed with the post as is and just avoid the features that will get broken.

stefaniebutland commented 6 years ago

@mstrimas It's whatever you think is best. Blog post timing is flexible. You only really get the audience once for this kind of thing though, so if you prefer to publish after updates to avoid frustrating people once they're engaged, then we can postpone. Perhaps you can draft your post ideas for yourself now before they go stale ;-) and then fill in soon after you've made the required changes to the package.

I'll mark my calendar to check in with you in late March.

mstrimas commented 6 years ago

Ok, that's what I'm thinking, only one chance to catch people's eye so better to have the package in tip top shape. Late March sounds good. Thanks!

stefaniebutland commented 6 years ago

@mstrimas

looks like there's going to be a large update to the data underlying this package in mid March

Checking in to see if timing is right for to draft a blog post - no rush if pkg not updated yet

maelle commented 6 years ago

By the way, it'd be grand if the blog post explained a bit how to choose between using auk and @sckott's rebird depending on the use case. 😺 The information could also be in the READMEs of both packages. Thinking of this because this week I was at a loss which of the two to recommend. 😀

mstrimas commented 6 years ago

@stefaniebutland I think the package is ready, I can start putting something together this week. Thanks for the reminder!

@maelle rebird is an interface to the eBird API, which gives access to a very limited subset of the data, e.g. the last 30 days of observations from a location. I think of rebird as being useful for building tools and visualizations for birders; however, for most ecological applications (e.g. distribution modeling) you'll want access to the full eBird database (~500 million records).

maelle commented 6 years ago

Thanks a lot for the explanations @mstrimas! It'd be a nice footnote as well in my opinion (of the post and READMEs).

When you say 30 days of observation you mean for raw occurrence data right? For frequency derived from it it seems you can get older data e.g. https://github.com/stephhazlitt/ruhu-ebird-observations/blob/master/R/ruhu-ebird-observations.md

mstrimas commented 6 years ago

@maelle I wasn't aware of the ebirdfreq() function, that's cool! Seems all the other functions are "recent" observations, but that one does give access to historical data at state, county, and hotspot level. It's also worth noting the rebird is easier to use and much faster, so if your data needs can be met by rebird, I'd say it's definitely preferred.

I'll add something to the README explaining the difference, thanks for the suggestion!

maelle commented 6 years ago

Awesome! It'll be super useful to guide users finding any of the 2 packages first! I wonder if the info should also live in the vignette because of people installing from CRAN and therefore not having the README 🤔

mstrimas commented 6 years ago

@maelle updated the README and vignette as per your suggestion

maelle commented 6 years ago

Fantastic! Speaking of other rOpenSci packages, I am also wondering whether/how one could use bowerbird (not an ornithology package despite its name) and auk to keep, update and use a local copy of eBird dataset, I might ping you if I ever try to write such an use case.

mstrimas commented 6 years ago

@stefaniebutland here's a first draft of a blog post.

What topicid and date should I use? Also, is there somewhere in the website repo I can put a couple data files (~ 3 MB). If there isn't a good spot, I'll just leave them in my GitHub repo.

If this looks good I can submit a pull request to the rOpenSci website repo.