Closed cboettig closed 2 years ago
Missing values: author1, repourl, submission-type
I fixed the submission (HTML comments were missing, they're needed for the bot @cboettig :slightly_smiling_face:).
@ropensci-review-bot check package
Thanks, about to send the query.
:rocket:
Editor check started
:wave:
git hash: 46b2cb4f
Important: All failing checks above must be addressed prior to proceeding
Package License: MIT + file LICENSE
This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.
The package has: - code in R (100% in 4 files) and - 1 authors - 1 vignette - no internal data file - 3 imported packages - 5 exported functions (median 9 lines of code) - 5 non-exported functions in R (median 12 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-----:|----------:|:----------| |files_R | 4| 23.3| | |files_vignettes | 1| 64.8| | |files_tests | 3| 71.1| | |loc_R | 52| 3.1|TRUE | |loc_vignettes | 106| 54.0| | |loc_tests | 24| 12.8| | |num_vignettes | 1| 60.7| | |n_fns_r | 10| 6.8| | |n_fns_r_exported | 5| 21.0| | |n_fns_r_not_exported | 5| 4.0|TRUE | |n_fns_per_file_r | 1| 12.6| | |num_params_per_fn | 2| 10.7| | |loc_per_fn_r | 10| 39.1| | |loc_per_fn_r_exp | 9| 20.3| | |loc_per_fn_r_not_exp | 12| 55.9| | |rel_whitespace_R | 31| 11.0| | |rel_whitespace_vignettes | 52| 78.8| | |rel_whitespace_tests | 38| 67.4| | |doclines_per_fn_exp | 15| 5.7| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 2| 3.4|TRUE | ---
Click to see the interactive network visualisation of calls between objects in package
goodpractice
and other checks#### 3a. Continuous Integration Badges [![https\:\/\/github](https://github.com/cboettig/gbifdb/workflows/R-CMD-check/badge.svg)](https\:\/\/github) **GitHub Workflow Results** |name |conclusion |sha |date | |:-----------|:----------|:------|:----------| |R-CMD-check |success |46b2cb |2021-11-30 | --- #### 3b. `goodpractice` results #### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) rcmdcheck found no errors, warnings, or notes #### Test coverage with [covr](https://covr.r-lib.org/) Package coverage: 53.12 The following files are not completely covered by tests: file | coverage --- | --- R/gbif_download.R | 0% R/gbif_remote.R | 60% #### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) No functions have cyclocomplexity >= 15 #### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found the following 1 potential issues: message | number of times --- | --- Lines should not be more than 80 characters. | 1
|package |version | |:--------|:---------| |pkgstats |0.0.3.52 | |pkgcheck |0.0.2.185 |
Processing may not proceed until the items marked with :heavy_multiplication_x: have been resolved.
Thanks! I think most of these issues are resolved in the repo now (e.g. by commit https://github.com/cboettig/gbifdb/commit/46b2cb4fb0b180a767c8f091d00533d024d3ea3c or prior).
Coverage is still less than 75% because I do not test gbif_download()
. This function downloads over 100 GB of data over AWS sync. This is a thin wrapper around aws.s3::s3_sync(), and I don't think anything is gained by a mocking test, but happy to be convinced otherwise.
there's no CITATION file still since there's no publication to point to, but there's now a codemeta.json file :see_no_evil: .
please let me know if there's anything else I ought to address first!
@ropensci-review-bot check package
Thanks, about to send the query.
:rocket:
Editor check started
:wave:
git hash: 015b74e7
Important: All failing checks above must be addressed prior to proceeding
Package License: MIT + file LICENSE
This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.
The package has: - code in R (100% in 5 files) and - 1 authors - 1 vignette - no internal data file - 3 imported packages - 5 exported functions (median 9 lines of code) - 5 non-exported functions in R (median 12 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-----:|----------:|:----------| |files_R | 5| 29.8| | |files_vignettes | 1| 64.8| | |files_tests | 3| 71.1| | |loc_R | 55| 3.4|TRUE | |loc_vignettes | 106| 54.0| | |loc_tests | 35| 17.1| | |num_vignettes | 1| 60.7| | |n_fns_r | 10| 6.8| | |n_fns_r_exported | 5| 21.0| | |n_fns_r_not_exported | 5| 4.0|TRUE | |n_fns_per_file_r | 1| 0.0|TRUE | |num_params_per_fn | 2| 10.7| | |loc_per_fn_r | 10| 39.1| | |loc_per_fn_r_exp | 9| 20.3| | |loc_per_fn_r_not_exp | 12| 55.9| | |rel_whitespace_R | 18| 6.8| | |rel_whitespace_vignettes | 52| 78.8| | |rel_whitespace_tests | 34| 69.2| | |doclines_per_fn_exp | 15| 5.7| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 2| 3.4|TRUE | ---
Click to see the interactive network visualisation of calls between objects in package
goodpractice
and other checks#### 3a. Continuous Integration Badges [![https\:\/\/github](https://github.com/cboettig/gbifdb/workflows/R-CMD-check/badge.svg)](https\:\/\/github) **GitHub Workflow Results** |name |conclusion |sha |date | |:-----------|:----------|:------|:----------| |R-CMD-check |failure |015b74 |2021-12-02 | --- #### 3b. `goodpractice` results #### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) R CMD check generated the following error: 1. checking tests ... Running ‘spelling.R’ Comparing ‘spelling.Rout’ to ‘spelling.Rout.save’ ... OK Running ‘testthat.R’ ERROR Running the tests in ‘tests/testthat.R’ failed. Last 13 lines of output: 2. └─arrow::s3_bucket(bucket, ...) 3. └─FileSystem$from_uri(bucket) 4. └─arrow:::fs___FileSystemFromUri(uri) ── Error (test_gbifdb.R:39:3): gbif_remote(to_duckdb=TRUE)....slow! ──────────── Error: NotImplemented: Got S3 URI but Arrow compiled without S3 support Backtrace: █ 1. └─gbifdb::gbif_remote(to_duckdb = TRUE) test_gbifdb.R:39:2 2. └─arrow::s3_bucket(bucket, ...) 3. └─FileSystem$from_uri(bucket) 4. └─arrow:::fs___FileSystemFromUri(uri) [ FAIL 2 | WARN 1 | SKIP 0 | PASS 4 ] Error: Test failures Execution halted R CMD check generated the following test_fail: 1. > library(testthat) > library(gbifdb) > > test_check("gbifdb") See arrow_info() for available features Attaching package: 'arrow' The following object is masked from 'package:testthat': matches The following object is masked from 'package:utils': timestamp Attaching package: 'dplyr' The following object is masked from 'package:testthat': matches The following objects are masked from 'package:stats': filter, lag The following objects are masked from 'package:base': intersect, setdiff, setequal, union ══ Warnings ════════════════════════════════════════════════════════════════════ ── Warning (test_gbifdb.R:1:1): (code run outside of `test_that()`) ──────────── `context()` was deprecated in the 3rd edition. Backtrace: 1. testthat::context("test gbifdb") test_gbifdb.R:1:0 2. testthat:::edition_deprecate(3, "context()") ══ Failed tests ════════════════════════════════════════════════════════════════ ── Error (test_gbifdb.R:30:3): gbif_remote() ─────────────────────────────────── Error: NotImplemented: Got S3 URI but Arrow compiled without S3 support Backtrace: █ 1. └─gbifdb::gbif_remote(to_duckdb = FALSE) test_gbifdb.R:30:2 2. └─arrow::s3_bucket(bucket, ...) 3. └─FileSystem$from_uri(bucket) 4. └─arrow:::fs___FileSystemFromUri(uri) ── Error (test_gbifdb.R:39:3): gbif_remote(to_duckdb=TRUE)....slow! ──────────── Error: NotImplemented: Got S3 URI but Arrow compiled without S3 support Backtrace: █ 1. └─gbifdb::gbif_remote(to_duckdb = TRUE) test_gbifdb.R:39:2 2. └─arrow::s3_bucket(bucket, ...) 3. └─FileSystem$from_uri(bucket) 4. └─arrow:::fs___FileSystemFromUri(uri) [ FAIL 2 | WARN 1 | SKIP 0 | PASS 4 ] Error: Test failures Execution halted R CMD check generated the following check_fail: 1. rcmdcheck_tests_pass #### Test coverage with [covr](https://covr.r-lib.org/) ERROR: Test Coverage Failed #### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) No functions have cyclocomplexity >= 15 #### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found no issues with this package!
|package |version | |:--------|:---------| |pkgstats |0.0.3.54 | |pkgcheck |0.0.2.187 |
Processing may not proceed until the items marked with :heavy_multiplication_x: have been resolved.
@cboettig This is the first submission with an arrow
dependency, for which standard installation of system libraries are inadequate. I'll rebuild the check system with a pre-install of arrow, and your package will at least pass R CMD check. Thanks for the good test case!
@ropensci-review-bot assign @jhollist as editor
Assigned! @jhollist is now the editor
Thanks @mpadge , but it's good to discover these errors now! I forgot that arrow
could optionally be built without S3 support. I've added checks so the test will skip if arrow::arrow_info()
reports that we arrow was built without s3 capability. (No doubt CRAN has at least one machine where that is also the case. Realistically I will need to skip that check on CRAN too I think; the operation is too network-intensive to be robust and again I don't think there's much value in creating a mock test that passes even when the network is unavailable).
Hi,@cboettig! Looking forward to working on this with you! I am full up today, but wanted to let you know that I have this on my radar and will start working on it early next week.
@mpadge Have you had a chance to add arrow to the check system? I want to hold off running the package checks until I know that is ready to roll.
Yep @jhollist it's up and rolling with arrow pre-compiled for S3 support. You may check away ...
@ropensci-review-bot check package
Thanks, about to send the query.
:rocket:
Editor check started
:wave:
git hash: 2c58f8e7
Important: All failing checks above must be addressed prior to proceeding
Package License: MIT + file LICENSE
This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.
The package has: - code in R (100% in 5 files) and - 1 authors - 1 vignette - no internal data file - 3 imported packages - 5 exported functions (median 9 lines of code) - 5 non-exported functions in R (median 12 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-----:|----------:|:----------| |files_R | 5| 29.8| | |files_vignettes | 1| 64.8| | |files_tests | 3| 71.1| | |loc_R | 56| 3.5|TRUE | |loc_vignettes | 106| 54.0| | |loc_tests | 43| 20.1| | |num_vignettes | 1| 60.7| | |n_fns_r | 10| 6.8| | |n_fns_r_exported | 5| 21.0| | |n_fns_r_not_exported | 5| 4.0|TRUE | |n_fns_per_file_r | 1| 0.0|TRUE | |num_params_per_fn | 2| 10.7| | |loc_per_fn_r | 10| 39.1| | |loc_per_fn_r_exp | 9| 20.3| | |loc_per_fn_r_not_exp | 12| 55.9| | |rel_whitespace_R | 20| 7.5| | |rel_whitespace_vignettes | 52| 78.8| | |rel_whitespace_tests | 44| 72.6| | |doclines_per_fn_exp | 15| 5.7| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 2| 3.4|TRUE | ---
Click to see the interactive network visualisation of calls between objects in package
goodpractice
and other checks#### 3a. Continuous Integration Badges [![https\:\/\/github](https://github.com/cboettig/gbifdb/workflows/R-CMD-check/badge.svg)](https\:\/\/github) **GitHub Workflow Results** |name |conclusion |sha |date | |:-----------|:----------|:------|:----------| |R-CMD-check |success |2c58f8 |2021-12-04 | --- #### 3b. `goodpractice` results #### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) rcmdcheck found no errors, warnings, or notes #### Test coverage with [covr](https://covr.r-lib.org/) Package coverage: 66.67 The following files are not completely covered by tests: file | coverage --- | --- R/gbif_download.R | 0% #### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) No functions have cyclocomplexity >= 15 #### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found no issues with this package!
|package |version | |:--------|:---------| |pkgstats |0.0.3.54 | |pkgcheck |0.0.2.195 |
Processing may not proceed until the items marked with :heavy_multiplication_x: have been resolved.
@jhollist Thanks much!
I've added a contributing.md
(usethis::use_tidy_contributing()
), apologies for missing that one.
As mentioned above, I don't think it makes sense to add CITATION at this stage, (R will already do the best option via citation()
, citeAs will recognize the codemeta.json (and I think the DESCRIPTION), Zenodo will I think continue to ignore all these and go by the github metadata or it's own .zenodo.json override instead). I love metadata but I'm not convinced data entry is improved by so much duplication at this stage.... If this ever has an external doc or paper to cite, I'll certainly add that to CITATION at that time....
Also not sure I can do much about code coverage without resorting to silly hacks. (I totally appreciate the spirit of the rule, but am never fond of putting to much emphasis on such statistics which become relatively easy to hack which only further erodes their utility....)
I think we're clear on the other checks? Cool auto-check system! Let me know what's next!
@cboettig A couple of items:
Go ahead and still add the CITATION file (see: https://devguide.ropensci.org/building.html?q=citatio#citation-file). Having this in there will allow you to track any Zenodo DOIs that get minted (although not necessary right now). Also, it might be useful to provide a citation for the GBIF data/AWS service.
I agree with you on the testing. I think you have a fairly complete test suite. I will probably ask the reviewers to think about this one as well, just to make sure we aren't missing something. That being said, nothing to do right now. I am happy to send on to reviewers with the current test suite.
I can't get the dev version of duckdb installed, but looks like the version on CRAN is > what you require in the DESCRIPTION.
When I run gbif_remote()
I get
Error: IOError: When getting information for key 'occurrence/2021-11-01/occurrence.parquet' in bucket 'gbif-open-data-af-south-1': AWS Error [code 15]: No response body.
I tried with different version and different bucket but did not work. Is this my machine (session info below) or do you see the same thing?
So next steps are to get the CITATION, double check that user needs the dev version of duckdb or if CRAN is fine, make sure gbif_remote() is working. After that I have some other local checks to do then I can pass it on to reviewers.
Make sense?
My Session Info:
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_1.0.7 gbifdb_0.1.0
loaded via a namespace (and not attached):
[1] arrow_6.0.1 fansi_0.5.0 assertthat_0.2.1 utf8_1.2.2 crayon_1.4.2 R6_2.5.1
[7] DBI_1.1.1 lifecycle_1.0.1 magrittr_2.0.1 pillar_1.6.4 rlang_0.4.12 remotes_2.4.1
[13] vctrs_0.3.8 generics_0.1.1 ellipsis_0.3.2 tools_4.1.0 bit64_4.0.5 glue_1.5.1
[19] bit_4.0.4 purrr_0.3.4 compiler_4.1.0 pkgconfig_2.0.3 tidyselect_1.1.1 tibble_3.1.6
Thanks @jhollist! That makes perfect sense. I've added CITATION now based on GBIF Citation guidelines for AWS bucket.
Hmm... I can't replicate the error you see. Note that the gbif_remote()
is included in the test suite, and appears to be passing the test on those OSs (https://github.com/cboettig/gbifdb/actions/runs/1546475635) as well as the automated checks above... :thinking: . I've tested locally on my dinky windows tablet with 4 GB of RAM on my crappy home network and it works there as well...
It looks to me like a network error though -- packets lost in transmission or something. You've confirmed it's repeatable? I've tried to get others to test this remote arrow connection, I think there were a few examples where someone had trouble but then it worked? I haven't been able to find a reproducible error though on any system I can access or get someone to test... How do we proceed?
I would probably recommend users fall back on the local download for high-throughput use or over cases where network bandwidth is too slow for network-based interactive use, which is why the package provides both options.
I can confirm that it is me, not you! Something with my work machine. I checked your actions after I sent my response and saw that they worked... I was able to get it to run on my son's laptop! I will sequester his machine and do the additional checks. I'll try to get through that tomorrow.
I'll also dig in a bit more on my machine and see if I can get it working. Stay tuned!
@cboettig I figured it out. I had an existing aws credentials file that was likely expired. I think authentication was failing and that was causing problems. Removed my old credentials file and tried again. Everything working! Will finish with checks on this tomorrow.
@jhollist brilliant detective work! :detective: I had forgotten about credentials! actually I'd seen that same problem before with aws.s3 download -- which fails even with valid credentials I believe because this is a public bucket, you can see I override them like so: https://github.com/cboettig/gbifdb/blob/main/R/gbif_download.R#L33
(not sure if that's the best option really, credentials in ~/.aws
dir could also get in the way I think) Anyway, would be a great issue to hash out with reviewer input! Meanwhile I can at least add the issue to the function documentation.
@ropensci-review-bot check package
Thanks, about to send the query.
:rocket:
Editor check started
:wave:
git hash: 3cf5429c
Important: All failing checks above must be addressed prior to proceeding
Package License: MIT + file LICENSE
This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.
The package has: - code in R (100% in 5 files) and - 1 authors - 1 vignette - no internal data file - 3 imported packages - 5 exported functions (median 10 lines of code) - 7 non-exported functions in R (median 12 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-----:|----------:|:----------| |files_R | 5| 29.8| | |files_vignettes | 1| 64.8| | |files_tests | 3| 71.1| | |loc_R | 68| 4.5|TRUE | |loc_vignettes | 106| 54.0| | |loc_tests | 43| 20.1| | |num_vignettes | 1| 60.7| | |n_fns_r | 12| 9.0| | |n_fns_r_exported | 5| 21.0| | |n_fns_r_not_exported | 7| 7.8| | |n_fns_per_file_r | 1| 11.5| | |num_params_per_fn | 2| 10.7| | |loc_per_fn_r | 10| 43.9| | |loc_per_fn_r_exp | 10| 23.4| | |loc_per_fn_r_not_exp | 12| 55.9| | |rel_whitespace_R | 25| 11.8| | |rel_whitespace_vignettes | 52| 78.8| | |rel_whitespace_tests | 44| 72.6| | |doclines_per_fn_exp | 15| 5.7| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 4| 9.7| | ---
Click to see the interactive network visualisation of calls between objects in package
goodpractice
and other checks#### 3a. Continuous Integration Badges [![https\:\/\/github](https://github.com/cboettig/gbifdb/workflows/R-CMD-check/badge.svg)](https\:\/\/github) **GitHub Workflow Results** |name |conclusion |sha |date | |:-----------|:----------|:------|:----------| |R-CMD-check |success |3cf542 |2021-12-07 | --- #### 3b. `goodpractice` results #### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) rcmdcheck found no errors, warnings, or notes #### Test coverage with [covr](https://covr.r-lib.org/) Package coverage: 70.73 #### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) No functions have cyclocomplexity >= 15 #### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found no issues with this package!
|package |version | |:--------|:---------| |pkgstats |0.0.3.56 | |pkgcheck |0.0.2.195 |
Processing may not proceed until the items marked with :heavy_multiplication_x: have been resolved.
rgbif
. You make the case that the difference is with generality and performance. I am not too worried about this but I will ask the reviewers to think about it and would ask you to think a bit more on what distinguishes gbifdb
from rgbif
. At a minimum, the case you make in the DESCRIPTION, would be useful to have on the README as well. X
!I will start working on finding reviewers. Looking forward to working on this!
@ropensci-review-bot assign @Pakillo as reviewer
@Pakillo added to the reviewers list. Review due date is 2021-12-29. Thanks @Pakillo for accepting to review! Please refer to our reviewer guide.
@Pakillo: If you haven't done so, please fill this form for us to update our reviewers records.
@ropensci-review-bot assign @stephhazlitt as reviewer
@stephhazlitt added to the reviewers list. Review due date is 2022-01-03. Thanks @stephhazlitt for accepting to review! Please refer to our reviewer guide.
@stephhazlitt: If you haven't done so, please fill this form for us to update our reviewers records.
@cboettig Given the time of year, I was planning on giving the reviewers a bit more time to complete the reviews. Would you be opposed to me setting the review due dates to Jan 10?
@jhollist certainly! :snowman_with_snow:
Thanks @cboettig!
@stephhazlitt and @Pakillo, note the extension on the due date. I set them for 2022-01-10.
@stephhazlitt and @Pakillo just a quick follow up to see how the reviews are coming along. Just holler if you need a bit more time.
Thank you for the opportunity to test drive and review this great new R package. {gbifdb} brings together many things near and dear to me -- easier access to lots and lots of open biodiversity data 🦋🦉🦡, R, and leveraging new open source tools like Arrow 🏹 and DuckDB 🦆 to make all of this faster and possible on a "regular" laptop (for my review I used a MacBook Pro M1 2020). I hope this feedback is helpful, please feel free to reach out if there are follow up questions or if there is anything I can do to support moving forward.
The package includes all the following forms of documentation:
URL
, BugReports
and Maintainer
(which may be autogenerated via Authors@R
).Estimated hours spent reviewing: 5ish
gbif_remote()
out of the box 🎉 :library(gbifdb)
library(dplyr)
library(tictoc)
gbif_s3 <- gbif_remote()
tic()
df <- gbif_s3 |>
filter(species == "Haematopus bachmani",
year > 2018,
stateprovince == "British Columbia") |>
count(year) |>
collect()
toc()
## Error: IOError: AWS Error [code 99]: curlCode: 28, Timeout was reached
## 6098.66 sec elapsed
colnames(gbif)
in the local data section of the README to get familiar with the GBIF data first. I think it would be useful to provide new (GBIF) users with the full name, the url and a couple more sentences on GBIF, what data is available (and what is not) and the easiest way to access metadata.gbif_download()
works great 🎉 and provide some data to work with (note I deleted the last, partial downloaded parquet file).../occurence.parquet/
to the end of that path for gbif_conn()
---it might be useful to add an example of this approach to the documentation:gbif_download(dir = "/Users/stephhazlitt/gbif-data/")
gbif_dir <- "/Users/stephhazlitt/gbif-data/occurrence.parquet/"
conn <- gbif_conn(gbif_dir)
gbif <- tbl(conn, "gbif")
gbif
to_duckdb = TRUE
pathways. It looks like the docs were written virtually at the same time as the {arrow} 6.0.0 release, which brought a lot of added {dplyr} support to {arrow} making this a reasonable alternative approach, at least with the amount of data I downloaded and the example data.gbif_conn()
+ tbl()
or using arrow::open_dataset()
:library(gbifdb)
library(dplyr)
library(arrow)
library(ggplot2)
gbif_parquet <- gbif_example_data()
## database
conn_example <- gbif_conn(gbif_parquet)
gbif_example <- tbl(conn_example, "gbif")
gbif_example |>
group_by(order, phylum) |>
summarise(n_species = n_distinct(species)) |>
collect() |>
ggplot() +
geom_col(aes(n_species, order, fill = phylum))
## arrow
open_dataset(gbif_parquet) |>
group_by(order, phylum) |>
summarise(n_species = n_distinct(species)) |>
collect() |>
ggplot() +
geom_col(aes(n_species, order, fill = phylum))
.md
file or in the README.md file?gbifdb
works for users is always nice in my experienceThanks @stephhazlitt this is super :tada: ! One design question you bring up which I'd really like advice on:
I completely hear you that it's awkward that gbif_conn()
returns a remote connection to a database, while gbif_remote()
returns a remote table; i.e. that in the former we need to remember to use dplyr::tbl()
but with the arrow-based interfaces we don't. This kind of begs for a gbif_local()
, which would just be tbl(gbif_conn(), "gbif")
. So Q1 Is it worth making dplyr
a hard dependency in order to add gbif_local()
?
Currently, dplyr
is only a soft dependency, since presumably some applications or users will just want the DBI connection object for which they are happy querying in SQL (some users/apps will know SQL well and may not like using dplyr's doing it for them). I think of dplyr
as a heavy dependency, but it's also common so maybe I should just bite that bullet?
Q2 I'm also obviously on the fence as to whether gbif_remote()
should just go ahead and call to_duckdb()
or not. Doing so would make the interface a bit more consistent between the two connection types. OTOH, to_duckdb()
can actually slow things down a bit a present when doing queries that arrow already knows how to do, which as you noted is quite a lot in arrow
6.0.1. Thoughts on either of these?
I totally hear you on the timing and network issues. I'll add some comments in the README and function docs. One interesting case to mention might be increasingly-popular cloud-hosted computing services, like https://rstudio.cloud. Similarly, I teach my courses using RStudio instances Berkeley provides that are similarly dinky in RAM and CPU, but because they are hosted on Google Cloud have access to good bandwidth. I think when using AWS-EC2 instances in the matching region as GBIF uses would see particularly impressive performance, comparable to the local download, but almost all cloud providers have rather fast network performance, even on instances that have less RAM or CPU power than a dinky laptop.
Benchmarking network issues is tricky though, or at least I've never been properly trained how to do it! Maybe some folks from the arrow team could advice me on that. With colleagues over at GBIF, we put together this draft blog post very recently: https://github.com/gbif/data-blog/blob/master/content/post/2022-01-04-make-big-queries-using-GBIF-parquet-snapshots.md. My local laptop is a crappy Window-Go tablet that struggles to run R, but I log in remotely to an rstudio session on my desktop at work where we actually have gigabit ethernet, so I see runtimes that are almost an order of magnitude faster than John reports for remote connections, and fewer connection issues.
Also good point about the laptop falling asleep on the whole-data download - can add notes to the docs about that too. It's about a 100 GB download. Google tells me the average download speed in the US is over 100MB/s but that sounds ridiculous, let's say you get 10 MB/s downloads it ought to take 2.7 hours I guess?
I may also be worth noting that while 100 GB isn't all that large for a laptop, it's much bigger than most typical cloud providers will provide, so neon_remote()
may be the only option there too.
Great questions.
I am biased towards a fully R workflow, so not sure my thoughts are representative of your user community, however I would go with the {dplyr} dependency and aim for a symmetrical 2-function per GBIF data location design:
gbif_remote()
- as is, arrow table connection
gbif_remore_db()
- wrapper for above with to_duckdb = TRUE
gbif_local()
- wrapper for arrow::open_dataset()
to read locally stored parquet data
gbif_local_db()
- for tbl(gbif_conn(), "gbif")
My thinking is that this brings maximum flexibility for users and the package, given tools and dependencies are all changing and improving so quickly.
If you really wanted to provide a lightweight version for non-{dplyr} users, perhaps you could make a one function package for gbif_conn()
and then use that package as a dependency in {gbifdb}? I am channeling learnings from the {bcdata} experience, where for sure if we had a redo it would be more than one package.
Nice, I like it! Just hashing out function names a bit more here:
I think going from 2 to functions (gbif_remote()
an gbif_local()
) to four seems a bit overkill. I imagine the end user doesn't care so much about whether it's duckdb vs arrow at the backend (particularly when functions like to_duckdb()
make an arrow connection use duckdb or vice versa. In particular, I'm not sure what we achieve by having the the two different flavors of gbif_local()
/ gbif_local_db()
. Sure the duckdb-based one supports nearly the full dbplyr functionality, while arrow doesn't do windowed operations (at least not yet), but that seems a more minor difference. (technically, duckdb
can access remote parquet functions directly without arrow
, which means we could drop the arrow
dependency and gbif_remote()
would match gbif_local()
use for all cases without needing to_duckdb()
to use windowed operations.... but so far, it's not enabled by default in the R client, https://github.com/duckdb/duckdb/issues/2177)
re audience, I'm with you on a fully R workflow, I just meant that there's probably a subset of old-school R users out there who have been using the DBI
R package since before dplyr/dbplyr and are probably happy to keep doing so. But I can keep gbif_conn() as a function for them but still make dplyr
a hard dependency I think. (I'm not sure, but this might mean dbplyr
should be a dependency too?) The two-package idea is a good one, but I think overkill here since this this is already a thin wrapper around duckdb
and arrow
.
Hi, apologies for the delay. Still playing with it. Hope to submit my review within a couple of days
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
URL
, BugReports
and Maintainer
(which may be autogenerated via Authors@R
).Estimated hours spent reviewing: 12
Thanks a lot for the invitation to review this package. I hesitated to accept the invitation at first. First, I doubt I can ever improve anything @cboettig has written. Second, although I have some modest experience using databases with dplyr
, I have never used arrow
or duckdb
. But, I have learnt a lot reviewing gbifdb
and, as a frequent GBIF user, my own work will benefit directly from this exercise (and the package itself, of course). I do hope my review is useful, at least from the perspective of an intermediate R user keen to start using the package.
The README/vignette is already very informative and nice. Some suggestions for improving it:
I think it would be good to mention somewhere that this GBIF dataset is a snapshot of "all CC0 and CC-BY licensed data in GBIF that have coordinates which passed automated quality checks", as explained here. From there I understand that occurrence records without coordinates will not be included in such dataset. I think some future users may be tricked by this, e.g. if they query number of records for some species over time, ignoring that records without coordinates will be missing.
"gbifdb
has few dependencies: only duckdb and DBI are required". But the DESCRIPTION file also imports arrow
?
"gbifdb currently requires the dev version of duckdb". Is this still correct? I installed the latest duckdb
version from GitHub as suggested (version 0.3.2). Yet the DESCRIPTION file specifies duckdb (>= 0.2.9). And CRAN already has version 0.3.1-1. Installing duckdb from github took quite a few minutes in my Ubuntu laptop, so if the CRAN version is already good, it would nice to update the README instructions accordingly. Otherwise update the version requirement in DESCRIPTION Imports?
From our tests and also your post at the GBIF data blog, it looks like gbifdb
can be much much slower than rgbif
for certain queries, e.g. retrieving occurrence records for a single species. I did not expect such a striking difference at first. I think it would be good to warn users about this already in the README. That is, for some use cases (like retrieving records for a single/few species or countries), it may be better to use rgbif
than gbifdb
, right?
From my tests, it also looks like choosing the closest AWS region can speed up queries (and downloading) considerably. Hence, it might be nice to mention this already in the README, e.g. when first showing the use of gbif_remote
, including an example with a different region than the default (also, is "af-south-1" the best default?). Otherwise many people may overlook this and wait unnecessarily long times (happened to me at first). The help of gbif_remote
could also be more informative about how to choose the best region in each case (eg. see https://github.com/Pakillo/gbifdb/commit/e34d465fd97e45515f61d33ff9f90dda5a2f001f).
When showing the use of gbif_remote
, it might be useful to include an entire example using collect
, for those users not so familiar with databases, e.g.
gbif %>%
filter(species == "Abies pinsapo",
countrycode == "ES",
year > 2000) %>%
count(year) %>%
collect()
# A tibble: 13 × 2
year n
<int> <int>
1 2017 4
2 2020 44
3 2018 24
4 2019 61
5 2016 6
6 2021 7
7 2006 5
8 2004 1
9 2015 3
10 2014 2
11 2013 2
12 2009 1
13 2010 1
gbif_download
looks really handy, but I struggled a bit to download the entire database, even using a good (fiber) internet connection.
This is the code I used to download:
gbif_download(dir = "/media/frs/Elements/gbifdb/",
region = "eu-central-1",
bucket = "gbif-open-data-eu-central-1")
My first attempt failed within 20 minutes:
# <== Saving object 'occurrence/2021-11-01/occurrence.parquet/000129' to '/media/frs/Elements/gbifdb///occurrence.parquet/000129'
# Error in curl::curl_fetch_disk(url, x$path, handle = handle) :
# Timeout was reached: [s3-eu-central-1.amazonaws.com] Resolving timed out after 10000 milliseconds
A second attempt gave an error after a while:
# <== Saving object 'occurrence/2021-11-01/occurrence.parquet/001044' to '/media/frs/Elements/gbifdb///occurrence.parquet/001044'
# 131 local files not found in bucket 'gbif-open-data-eu-central-1'
# List of 4
# $ Code : chr "AccessDenied"
# $ Message : chr "Access Denied"
# $ RequestId: chr "CA2X7EK3FCXH7RP2"
# $ HostId : chr "9h1TicaLFACsutywE7S7etD/yuYckCyxHKbnDfL7zd7hJv2PTCqPxkV8m7gwtxKCtWPCnQEPca0="
# - attr(*, "headers")=List of 7
# ..$ x-amz-request-id : chr "CA2X7EK3FCXH7RP2"
# ..$ x-amz-id-2 : chr "9h1TicaLFACsutywE7S7etD/yuYckCyxHKbnDfL7zd7hJv2PTCqPxkV8m7gwtxKCtWPCnQEPca0="
# ..$ content-type : chr "application/xml"
# ..$ transfer-encoding: chr "chunked"
# ..$ date : chr "Thu, 13 Jan 2022 11:22:45 GMT"
# ..$ server : chr "AmazonS3"
# ..$ connection : chr "close"
# ..- attr(*, "class")= chr [1:2] "insensitive" "list"
# - attr(*, "class")= chr "aws_error"
# NULL
# Error in parse_aws_s3_response(r, Sig, verbose = verbose) :
# Forbidden (HTTP 403).
Third attempt also gave similar error, even though it looks like all the files had been successfully downloaded (1045 parquet files plus the citation file):
# <== Saving object 'occurrence/2021-11-01/occurrence.parquet/001044' to '/media/frs/Elements/gbifdb///occurrence.parquet/001044'
# 1046 local files not found in bucket 'gbif-open-data-eu-central-1'
# List of 4
# $ Code : chr "AccessDenied"
# $ Message : chr "Access Denied"
# $ RequestId: chr "21VETY51D6FQ6K3A"
# $ HostId : chr "couGo2j17BhB5a8k3GFjI8gIrO3akAbRzCaikOOk30sEIr7d4nCwBLcZRdA0wlYAwGF969q2KZ4="
# - attr(*, "headers")=List of 7
# ..$ x-amz-request-id : chr "21VETY51D6FQ6K3A"
# ..$ x-amz-id-2 : chr "couGo2j17BhB5a8k3GFjI8gIrO3akAbRzCaikOOk30sEIr7d4nCwBLcZRdA0wlYAwGF969q2KZ4="
# ..$ content-type : chr "application/xml"
# ..$ transfer-encoding: chr "chunked"
# ..$ date : chr "Thu, 13 Jan 2022 13:55:31 GMT"
# ..$ server : chr "AmazonS3"
# ..$ connection : chr "close"
# ..- attr(*, "class")= chr [1:2] "insensitive" "list"
# - attr(*, "class")= chr "aws_error"
# NULL
# Error in parse_aws_s3_response(r, Sig, verbose = verbose) :
# Forbidden (HTTP 403).
In general, asking from ignorance, is there any way to make the download easier/less fragile to the user? Also, it seems the download starts from scratch every time you call gbif_download
(that is, files already on disk are replaced again, even if requesting the same GBIF snapshot)?
gbif_download
documentation may be improved a bit, e.g. see https://github.com/Pakillo/gbifdb/commit/0ddc505b4626e158871643fafe991a795d3c77a6.gbif_conn
with a user-provided dir. My first attempt failed:con <- gbif_conn(dir = "/media/frs/Elements/gbifdb/")
Error in .local(conn, statement, ...) :
duckdb_prepare_R: Failed to prepare query CREATE VIEW 'gbif' AS SELECT * FROM parquet_scan('/media/frs/Elements/gbifdb//*');
Error: Invalid Input Error: No magic bytes found at end of file '/media/frs/Elements/gbifdb/citation.txt'
You have to add "occurrence.parquet" for it to work:
con <- gbif_conn(dir = "/media/frs/Elements/gbifdb/occurrence.parquet/")
library(gbifdb)
library(dplyr)
con <- gbif_conn(dir = "/media/frs/Elements/gbifdb/occurrence.parquet/")
gbif <- tbl(con, "gbif")
pinsapo <- gbif %>%
filter(species == "Abies pinsapo",
countrycode == "ES",
year > 2010) %>%
select(species, year, decimallatitude, decimallongitude)
pinsapo
#> # Source: lazy query [?? x 4]
#> # Database: duckdb_connection
#> species year decimallatitude decimallongitude
#> <chr> <int> <dbl> <dbl>
#> 1 Abies pinsapo 2013 36.7 -4.96
#> 2 Abies pinsapo 2013 36.8 -5.39
#> 3 Abies pinsapo 2021 36.7 -5.04
#> 4 Abies pinsapo 2018 40.5 -4.1
#> 5 Abies pinsapo 2020 40.9 -4
#> 6 Abies pinsapo 2019 41.7 -3.7
#> 7 Abies pinsapo 2021 36.7 -5.04
#> 8 Abies pinsapo 2021 36.7 -5.04
#> 9 Abies pinsapo 2019 40.5 -3.3
#> 10 Abies pinsapo 2019 40.7 -4.7
#> # … with more rows
However, there seems to be a problem when two special columns (mediatype
and issue
) are selected for return in the results table, causing R session to crash (see https://github.com/cboettig/gbifdb/issues/2). I hope this will be easy to fix. Otherwise it should be warned in the package documentation.
Minor comments
@Pakillo Thanks for this amazing review (and the help tracking down the issues with in https://github.com/cboettig/gbifdb/issues/2)! Still working through it, but wanted to jot down what I've tried to address so far (See diff in PR#3).
I've updated the default bucket to us-east-1
instead of South Africa (A biased default to be sure, but also what I believe most AWS clients default to when nothing is set). I've tried to note in the docs that users probably want to set the bucket accordingly based on their location.
Not entirely sure what was up with gbif_download()
, I'll test more too, but I've made a few tweaks. Like you say, setting a closer region helps, and I've made sure to unset alternative credentials more aggressively. My understanding of aws.s3::s3sync()
behavior is that it is checking md5sums so should not re-download, but I haven't tested to confirm that. Now that arrow
is a dependency I'm considering using it's S3-interface to do the download instead, but it is much more low-level. It's quite possible that the awscli
tool (as shown at https://registry.opendata.aws/gbif/) at or other such tools give significantly better performance. Maybe I should point users at that and deprecate this function...
I think I've addressed some of the performance issues you identified in gbif remote now. May still be slower than rgbif for small queries, (again depending on network speed) but will need to benchmark more.
more to come, gotta run
Date accepted: 2022-03-22 Submitting Author Name: Carl Boettiger Submitting Author Github Handle: !--author1-->@cboettig<!--end-author1-- Repository: https://github.com/cboettig/gbifdb Version submitted: 0.1.0 Submission type: Standard Editor: !--editor-->@jhollist<!--end-editor-- Reviewers: @Pakillo, @stephhazlitt
Due date for @Pakillo: 2022-01-10 Due date for @stephhazlitt: 2022-01-10Archive: TBD Version accepted: TBD Language: en
Scope
Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
Explain how and why the package falls under these categories (briefly, 1-2 sentences):
This package is similar in scope to
rgbif
, but allows direct access to the full occurrence table usingdplyr
functions with or without downloading.Ecologists etc using / analyzing GBIF data. In particular, anyone attempting analyses which need the whole GBIF data corpus, like the example shown in the README, cannot easily do so with the standard API / rgbif approach.
As noted above, this is similar in scope to rgbif, but more general and faster.
(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?
If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.
Technical checks
Confirm each of the following by checking the box.
This package:
Publication options
[x] Do you intend for this package to go on CRAN?
[ ] Do you intend for this package to go on Bioconductor?
[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:
MEE Options
- [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)Code of conduct