gbifdb: Local Database Interface to 'GBIF'

cboettig commented 2 years ago

Date accepted: 2022-03-22 Submitting Author Name: Carl Boettiger Submitting Author Github Handle: !--author1-->@cboettig@jhollist<!--end-editor-- Reviewers: @Pakillo, @stephhazlitt

Due date for @Pakillo: 2022-01-10 Due date for @stephhazlitt: 2022-01-10

Archive: TBD Version accepted: TBD Language: en

Paste the full DESCRIPTION file inside a code block below:

Package: gbifdb
Version: 0.1.0
Title: Local Database Interface to 'GBIF'
Description: A high performance interface to the Global Biodiversity
  Information Facility, 'GBIF'.  In contrast to 'rgbif', which can
  access small subsets of 'GBIF' data through web-based queries to
  a central server, 'gbifdb' provides enhanced performance for R users
  performing large-scale analyses on servers and cloud computing
  providers, providing full support for arbitrary 'SQL' or 'dplyr'
  operations on the complete 'GBIF' data tables (now over 1 billion
  records, and over a terrabyte in size.) 'gbifdb' accesses a copy
  of the 'GBIF' data in 'parquet' format, which is already readily
  available in commercial computing clouds such as the Amazon Open
  Data portal and the Microsoft Planetary Computer, or can be 
  accessed directly or downloaded
  to any server with suitable bandwidth and storage space.
Authors@R: c(
            person("Carl", "Boettiger", , "cboettig@gmail.com", c("aut", "cre"),
                   comment = c(ORCID = "0000-0002-1642-628X"))
            )
License: MIT + file LICENSE
Encoding: UTF-8
ByteCompile: true
Depends: R (>= 4.0)
Imports:
    duckdb (>= 0.2.9),
    DBI
Suggests:
    spelling,
    dplyr,
    dbplyr,
    testthat (>= 3.0.0),
    covr,
    knitr,
    rmarkdown,
    aws.s3,
    arrow
URL: https://github.com/cboettig/gbifdb
BugReports: https://github.com/cboettig/gbifdb
Language: en-US
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.2
Config/testthat/edition: 3
VignetteBuilder: knitr

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- [x] data retrieval
- [ ] data extraction
- [ ] data munging
- [ ] data deposition
- [ ] workflow automation
- [ ] version control
- [ ] citation management and bibliometrics
- [ ] scientific software wrappers
- [ ] field and lab reproducibility tools
- [ ] database software bindings
- [ ] geospatial data
- [ ] text analysis
Explain how and why the package falls under these categories (briefly, 1-2 sentences):

This package is similar in scope to rgbif, but allows direct access to the full occurrence table using dplyr functions with or without downloading.

Who is the target audience and what are scientific applications of this package?

Ecologists etc using / analyzing GBIF data. In particular, anyone attempting analyses which need the whole GBIF data corpus, like the example shown in the README, cannot easily do so with the standard API / rgbif approach.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

As noted above, this is similar in scope to rgbif, but more general and faster.

(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?
If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

Technical checks

Confirm each of the following by checking the box.

[x] I have read the guide for authors and rOpenSci packaging guide.

This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has a CRAN and OSI accepted license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions, created with roxygen2.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, including reporting of test coverage using services such as Travis CI, Coveralls and/or CodeCov.

Publication options

[x] Do you intend for this package to go on CRAN?
[ ] Do you intend for this package to go on Bioconductor?
[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options

- [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

[x] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

ropensci-review-bot commented 2 years ago

Missing values: author1, repourl, submission-type

maelle commented 2 years ago

I fixed the submission (HTML comments were missing, they're needed for the bot @cboettig :slightly_smiling_face:).

ldecicco-USGS commented 2 years ago

@ropensci-review-bot check package

ropensci-review-bot commented 2 years ago

Thanks, about to send the query.

ropensci-review-bot commented 2 years ago

:rocket:

Editor check started

:wave:

ropensci-review-bot commented 2 years ago

Checks for gbifdb (v0.1.0)

git hash: 46b2cb4f

:heavy_check_mark: Package name is available
:heavy_multiplication_x: does not have a 'CITATION' file.
:heavy_multiplication_x: does not have a 'codemeta.json' file.
:heavy_multiplication_x: does not have a 'contributing' file.
:heavy_check_mark: uses 'roxygen2'.
:heavy_check_mark: 'DESCRIPTION' has a URL field.
:heavy_check_mark: 'DESCRIPTION' has a BugReports field.
:heavy_check_mark: Package has at least one HTML vignette
:heavy_multiplication_x: These functions do not have examples: [gbif_dir].
:heavy_check_mark: Package has continuous integration checks.
:heavy_multiplication_x: Package coverage is 53.1% (should be at least 75%).
:heavy_check_mark: R CMD check found no errors.
:heavy_check_mark: R CMD check found no warnings.

Important: All failing checks above must be addressed prior to proceeding

Package License: MIT + file LICENSE

1. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has: - code in R (100% in 4 files) and - 1 authors - 1 vignette - no internal data file - 3 imported packages - 5 exported functions (median 9 lines of code) - 5 non-exported functions in R (median 12 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-----:|----------:|:----------| |files_R | 4| 23.3| | |files_vignettes | 1| 64.8| | |files_tests | 3| 71.1| | |loc_R | 52| 3.1|TRUE | |loc_vignettes | 106| 54.0| | |loc_tests | 24| 12.8| | |num_vignettes | 1| 60.7| | |n_fns_r | 10| 6.8| | |n_fns_r_exported | 5| 21.0| | |n_fns_r_not_exported | 5| 4.0|TRUE | |n_fns_per_file_r | 1| 12.6| | |num_params_per_fn | 2| 10.7| | |loc_per_fn_r | 10| 39.1| | |loc_per_fn_r_exp | 9| 20.3| | |loc_per_fn_r_not_exp | 12| 55.9| | |rel_whitespace_R | 31| 11.0| | |rel_whitespace_vignettes | 52| 78.8| | |rel_whitespace_tests | 38| 67.4| | |doclines_per_fn_exp | 15| 5.7| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 2| 3.4|TRUE | ---

1a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package

2. `goodpractice` and other checks

Details of goodpractice and other checks (click to open)

#### 3a. Continuous Integration Badges [![https\:\/\/github](https://github.com/cboettig/gbifdb/workflows/R-CMD-check/badge.svg)](https\:\/\/github) **GitHub Workflow Results** |name |conclusion |sha |date | |:-----------|:----------|:------|:----------| |R-CMD-check |success |46b2cb |2021-11-30 | --- #### 3b. `goodpractice` results #### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) rcmdcheck found no errors, warnings, or notes #### Test coverage with [covr](https://covr.r-lib.org/) Package coverage: 53.12 The following files are not completely covered by tests: file | coverage --- | --- R/gbif_download.R | 0% R/gbif_remote.R | 60% #### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) No functions have cyclocomplexity >= 15 #### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found the following 1 potential issues: message | number of times --- | --- Lines should not be more than 80 characters. | 1

Package Versions

|package |version | |:--------|:---------| |pkgstats |0.0.3.52 | |pkgcheck |0.0.2.185 |

Editor-in-Chief Instructions:

Processing may not proceed until the items marked with :heavy_multiplication_x: have been resolved.

cboettig commented 2 years ago

Thanks! I think most of these issues are resolved in the repo now (e.g. by commit https://github.com/cboettig/gbifdb/commit/46b2cb4fb0b180a767c8f091d00533d024d3ea3c or prior).

Coverage is still less than 75% because I do not test gbif_download(). This function downloads over 100 GB of data over AWS sync. This is a thin wrapper around aws.s3::s3_sync(), and I don't think anything is gained by a mocking test, but happy to be convinced otherwise.

there's no CITATION file still since there's no publication to point to, but there's now a codemeta.json file :see_no_evil: .

please let me know if there's anything else I ought to address first!

ldecicco-USGS commented 2 years ago

@ropensci-review-bot check package

ropensci-review-bot commented 2 years ago

Thanks, about to send the query.

ropensci-review-bot commented 2 years ago

:rocket:

Editor check started

:wave:

ropensci-review-bot commented 2 years ago

Checks for gbifdb (v0.1.0)

git hash: 015b74e7

:heavy_check_mark: Package name is available
:heavy_multiplication_x: does not have a 'CITATION' file.
:heavy_check_mark: has a 'codemeta.json' file.
:heavy_multiplication_x: does not have a 'contributing' file.
:heavy_check_mark: uses 'roxygen2'.
:heavy_check_mark: 'DESCRIPTION' has a URL field.
:heavy_check_mark: 'DESCRIPTION' has a BugReports field.
:heavy_check_mark: Package has at least one HTML vignette
:heavy_check_mark: All functions have examples.
:heavy_check_mark: Package has continuous integration checks.
:heavy_multiplication_x: Package coverage failed
:heavy_multiplication_x: R CMD check found 1error.
:heavy_check_mark: R CMD check found no warnings.

Important: All failing checks above must be addressed prior to proceeding

Package License: MIT + file LICENSE

1. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has: - code in R (100% in 5 files) and - 1 authors - 1 vignette - no internal data file - 3 imported packages - 5 exported functions (median 9 lines of code) - 5 non-exported functions in R (median 12 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-----:|----------:|:----------| |files_R | 5| 29.8| | |files_vignettes | 1| 64.8| | |files_tests | 3| 71.1| | |loc_R | 55| 3.4|TRUE | |loc_vignettes | 106| 54.0| | |loc_tests | 35| 17.1| | |num_vignettes | 1| 60.7| | |n_fns_r | 10| 6.8| | |n_fns_r_exported | 5| 21.0| | |n_fns_r_not_exported | 5| 4.0|TRUE | |n_fns_per_file_r | 1| 0.0|TRUE | |num_params_per_fn | 2| 10.7| | |loc_per_fn_r | 10| 39.1| | |loc_per_fn_r_exp | 9| 20.3| | |loc_per_fn_r_not_exp | 12| 55.9| | |rel_whitespace_R | 18| 6.8| | |rel_whitespace_vignettes | 52| 78.8| | |rel_whitespace_tests | 34| 69.2| | |doclines_per_fn_exp | 15| 5.7| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 2| 3.4|TRUE | ---

1a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package

2. `goodpractice` and other checks

Details of goodpractice and other checks (click to open)

#### 3a. Continuous Integration Badges [![https\:\/\/github](https://github.com/cboettig/gbifdb/workflows/R-CMD-check/badge.svg)](https\:\/\/github) **GitHub Workflow Results** |name |conclusion |sha |date | |:-----------|:----------|:------|:----------| |R-CMD-check |failure |015b74 |2021-12-02 | --- #### 3b. `goodpractice` results #### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) R CMD check generated the following error: 1. checking tests ... Running ‘spelling.R’ Comparing ‘spelling.Rout’ to ‘spelling.Rout.save’ ... OK Running ‘testthat.R’ ERROR Running the tests in ‘tests/testthat.R’ failed. Last 13 lines of output: 2. └─arrow::s3_bucket(bucket, ...) 3. └─FileSystem$from_uri(bucket) 4. └─arrow:::fs___FileSystemFromUri(uri) ── Error (test_gbifdb.R:39:3): gbif_remote(to_duckdb=TRUE)....slow! ──────────── Error: NotImplemented: Got S3 URI but Arrow compiled without S3 support Backtrace: █ 1. └─gbifdb::gbif_remote(to_duckdb = TRUE) test_gbifdb.R:39:2 2. └─arrow::s3_bucket(bucket, ...) 3. └─FileSystem$from_uri(bucket) 4. └─arrow:::fs___FileSystemFromUri(uri) [ FAIL 2 | WARN 1 | SKIP 0 | PASS 4 ] Error: Test failures Execution halted R CMD check generated the following test_fail: 1. > library(testthat) > library(gbifdb) > > test_check("gbifdb") See arrow_info() for available features Attaching package: 'arrow' The following object is masked from 'package:testthat': matches The following object is masked from 'package:utils': timestamp Attaching package: 'dplyr' The following object is masked from 'package:testthat': matches The following objects are masked from 'package:stats': filter, lag The following objects are masked from 'package:base': intersect, setdiff, setequal, union ══ Warnings ════════════════════════════════════════════════════════════════════ ── Warning (test_gbifdb.R:1:1): (code run outside of `test_that()`) ──────────── `context()` was deprecated in the 3rd edition. Backtrace: 1. testthat::context("test gbifdb") test_gbifdb.R:1:0 2. testthat:::edition_deprecate(3, "context()") ══ Failed tests ════════════════════════════════════════════════════════════════ ── Error (test_gbifdb.R:30:3): gbif_remote() ─────────────────────────────────── Error: NotImplemented: Got S3 URI but Arrow compiled without S3 support Backtrace: █ 1. └─gbifdb::gbif_remote(to_duckdb = FALSE) test_gbifdb.R:30:2 2. └─arrow::s3_bucket(bucket, ...) 3. └─FileSystem$from_uri(bucket) 4. └─arrow:::fs___FileSystemFromUri(uri) ── Error (test_gbifdb.R:39:3): gbif_remote(to_duckdb=TRUE)....slow! ──────────── Error: NotImplemented: Got S3 URI but Arrow compiled without S3 support Backtrace: █ 1. └─gbifdb::gbif_remote(to_duckdb = TRUE) test_gbifdb.R:39:2 2. └─arrow::s3_bucket(bucket, ...) 3. └─FileSystem$from_uri(bucket) 4. └─arrow:::fs___FileSystemFromUri(uri) [ FAIL 2 | WARN 1 | SKIP 0 | PASS 4 ] Error: Test failures Execution halted R CMD check generated the following check_fail: 1. rcmdcheck_tests_pass #### Test coverage with [covr](https://covr.r-lib.org/) ERROR: Test Coverage Failed #### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) No functions have cyclocomplexity >= 15 #### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found no issues with this package!

Package Versions

|package |version | |:--------|:---------| |pkgstats |0.0.3.54 | |pkgcheck |0.0.2.187 |

Editor-in-Chief Instructions:

Processing may not proceed until the items marked with :heavy_multiplication_x: have been resolved.

mpadge commented 2 years ago

@cboettig This is the first submission with an arrow dependency, for which standard installation of system libraries are inadequate. I'll rebuild the check system with a pre-install of arrow, and your package will at least pass R CMD check. Thanks for the good test case!

ldecicco-USGS commented 2 years ago

@ropensci-review-bot assign @jhollist as editor

ropensci-review-bot commented 2 years ago

Assigned! @jhollist is now the editor

cboettig commented 2 years ago

Thanks @mpadge , but it's good to discover these errors now! I forgot that arrow could optionally be built without S3 support. I've added checks so the test will skip if arrow::arrow_info() reports that we arrow was built without s3 capability. (No doubt CRAN has at least one machine where that is also the case. Realistically I will need to skip that check on CRAN too I think; the operation is too network-intensive to be robust and again I don't think there's much value in creating a mock test that passes even when the network is unavailable).

jhollist commented 2 years ago

Hi,@cboettig! Looking forward to working on this with you! I am full up today, but wanted to let you know that I have this on my radar and will start working on it early next week.

jhollist commented 2 years ago

@mpadge Have you had a chance to add arrow to the check system? I want to hold off running the package checks until I know that is ready to roll.

mpadge commented 2 years ago

Yep @jhollist it's up and rolling with arrow pre-compiled for S3 support. You may check away ...

jhollist commented 2 years ago

@ropensci-review-bot check package

ropensci-review-bot commented 2 years ago

Thanks, about to send the query.

ropensci-review-bot commented 2 years ago

:rocket:

Editor check started

:wave:

ropensci-review-bot commented 2 years ago

Checks for gbifdb (v0.1.0)

git hash: 2c58f8e7

:heavy_check_mark: Package name is available
:heavy_multiplication_x: does not have a 'CITATION' file.
:heavy_check_mark: has a 'codemeta.json' file.
:heavy_multiplication_x: does not have a 'contributing' file.
:heavy_check_mark: uses 'roxygen2'.
:heavy_check_mark: 'DESCRIPTION' has a URL field.
:heavy_check_mark: 'DESCRIPTION' has a BugReports field.
:heavy_check_mark: Package has at least one HTML vignette
:heavy_check_mark: All functions have examples.
:heavy_check_mark: Package has continuous integration checks.
:heavy_multiplication_x: Package coverage is 66.7% (should be at least 75%).
:heavy_check_mark: R CMD check found no errors.
:heavy_check_mark: R CMD check found no warnings.

Important: All failing checks above must be addressed prior to proceeding

Package License: MIT + file LICENSE

1. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has: - code in R (100% in 5 files) and - 1 authors - 1 vignette - no internal data file - 3 imported packages - 5 exported functions (median 9 lines of code) - 5 non-exported functions in R (median 12 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-----:|----------:|:----------| |files_R | 5| 29.8| | |files_vignettes | 1| 64.8| | |files_tests | 3| 71.1| | |loc_R | 56| 3.5|TRUE | |loc_vignettes | 106| 54.0| | |loc_tests | 43| 20.1| | |num_vignettes | 1| 60.7| | |n_fns_r | 10| 6.8| | |n_fns_r_exported | 5| 21.0| | |n_fns_r_not_exported | 5| 4.0|TRUE | |n_fns_per_file_r | 1| 0.0|TRUE | |num_params_per_fn | 2| 10.7| | |loc_per_fn_r | 10| 39.1| | |loc_per_fn_r_exp | 9| 20.3| | |loc_per_fn_r_not_exp | 12| 55.9| | |rel_whitespace_R | 20| 7.5| | |rel_whitespace_vignettes | 52| 78.8| | |rel_whitespace_tests | 44| 72.6| | |doclines_per_fn_exp | 15| 5.7| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 2| 3.4|TRUE | ---

1a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package

2. `goodpractice` and other checks

Details of goodpractice and other checks (click to open)

#### 3a. Continuous Integration Badges [![https\:\/\/github](https://github.com/cboettig/gbifdb/workflows/R-CMD-check/badge.svg)](https\:\/\/github) **GitHub Workflow Results** |name |conclusion |sha |date | |:-----------|:----------|:------|:----------| |R-CMD-check |success |2c58f8 |2021-12-04 | --- #### 3b. `goodpractice` results #### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) rcmdcheck found no errors, warnings, or notes #### Test coverage with [covr](https://covr.r-lib.org/) Package coverage: 66.67 The following files are not completely covered by tests: file | coverage --- | --- R/gbif_download.R | 0% #### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) No functions have cyclocomplexity >= 15 #### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found no issues with this package!

Package Versions

|package |version | |:--------|:---------| |pkgstats |0.0.3.54 | |pkgcheck |0.0.2.195 |

Editor-in-Chief Instructions:

Processing may not proceed until the items marked with :heavy_multiplication_x: have been resolved.

cboettig commented 2 years ago

@jhollist Thanks much!

I've added a contributing.md (usethis::use_tidy_contributing()), apologies for missing that one.

As mentioned above, I don't think it makes sense to add CITATION at this stage, (R will already do the best option via citation(), citeAs will recognize the codemeta.json (and I think the DESCRIPTION), Zenodo will I think continue to ignore all these and go by the github metadata or it's own .zenodo.json override instead). I love metadata but I'm not convinced data entry is improved by so much duplication at this stage.... If this ever has an external doc or paper to cite, I'll certainly add that to CITATION at that time....

Also not sure I can do much about code coverage without resorting to silly hacks. (I totally appreciate the spirit of the rule, but am never fond of putting to much emphasis on such statistics which become relatively easy to hack which only further erodes their utility....)

I think we're clear on the other checks? Cool auto-check system! Let me know what's next!

jhollist commented 2 years ago

@cboettig A couple of items:

Go ahead and still add the CITATION file (see: https://devguide.ropensci.org/building.html?q=citatio#citation-file). Having this in there will allow you to track any Zenodo DOIs that get minted (although not necessary right now). Also, it might be useful to provide a citation for the GBIF data/AWS service.
I agree with you on the testing. I think you have a fairly complete test suite. I will probably ask the reviewers to think about this one as well, just to make sure we aren't missing something. That being said, nothing to do right now. I am happy to send on to reviewers with the current test suite.
I can't get the dev version of duckdb installed, but looks like the version on CRAN is > what you require in the DESCRIPTION.

When I run gbif_remote() I get

Error: IOError: When getting information for key 'occurrence/2021-11-01/occurrence.parquet' in bucket 'gbif-open-data-af-south-1': AWS Error [code 15]: No response body.

I tried with different version and different bucket but did not work. Is this my machine (session info below) or do you see the same thing?

So next steps are to get the CITATION, double check that user needs the dev version of duckdb or if CRAN is fine, make sure gbif_remote() is working. After that I have some other local checks to do then I can pass it on to reviewers.

Make sense?

My Session Info:

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.0.7  gbifdb_0.1.0

loaded via a namespace (and not attached):
 [1] arrow_6.0.1      fansi_0.5.0      assertthat_0.2.1 utf8_1.2.2       crayon_1.4.2     R6_2.5.1        
 [7] DBI_1.1.1        lifecycle_1.0.1  magrittr_2.0.1   pillar_1.6.4     rlang_0.4.12     remotes_2.4.1   
[13] vctrs_0.3.8      generics_0.1.1   ellipsis_0.3.2   tools_4.1.0      bit64_4.0.5      glue_1.5.1      
[19] bit_4.0.4        purrr_0.3.4      compiler_4.1.0   pkgconfig_2.0.3  tidyselect_1.1.1 tibble_3.1.6

cboettig commented 2 years ago

Thanks @jhollist! That makes perfect sense. I've added CITATION now based on GBIF Citation guidelines for AWS bucket.

Hmm... I can't replicate the error you see. Note that the gbif_remote() is included in the test suite, and appears to be passing the test on those OSs (https://github.com/cboettig/gbifdb/actions/runs/1546475635) as well as the automated checks above... :thinking: . I've tested locally on my dinky windows tablet with 4 GB of RAM on my crappy home network and it works there as well...

It looks to me like a network error though -- packets lost in transmission or something. You've confirmed it's repeatable? I've tried to get others to test this remote arrow connection, I think there were a few examples where someone had trouble but then it worked? I haven't been able to find a reproducible error though on any system I can access or get someone to test... How do we proceed?

I would probably recommend users fall back on the local download for high-throughput use or over cases where network bandwidth is too slow for network-based interactive use, which is why the package provides both options.

jhollist commented 2 years ago

I can confirm that it is me, not you! Something with my work machine. I checked your actions after I sent my response and saw that they worked... I was able to get it to run on my son's laptop! I will sequester his machine and do the additional checks. I'll try to get through that tomorrow.

I'll also dig in a bit more on my machine and see if I can get it working. Stay tuned!

jhollist commented 2 years ago

@cboettig I figured it out. I had an existing aws credentials file that was likely expired. I think authentication was failing and that was causing problems. Removed my old credentials file and tried again. Everything working! Will finish with checks on this tomorrow.

cboettig commented 2 years ago

@jhollist brilliant detective work! :detective: I had forgotten about credentials! actually I'd seen that same problem before with aws.s3 download -- which fails even with valid credentials I believe because this is a public bucket, you can see I override them like so: https://github.com/cboettig/gbifdb/blob/main/R/gbif_download.R#L33

(not sure if that's the best option really, credentials in ~/.aws dir could also get in the way I think) Anyway, would be a great issue to hash out with reviewer input! Meanwhile I can at least add the issue to the function documentation.

jhollist commented 2 years ago

@ropensci-review-bot check package

ropensci-review-bot commented 2 years ago

Thanks, about to send the query.

ropensci-review-bot commented 2 years ago

:rocket:

Editor check started

:wave:

ropensci-review-bot commented 2 years ago

Checks for gbifdb (v0.1.0)

git hash: 3cf5429c

:heavy_check_mark: Package name is available
:heavy_check_mark: has a 'CITATION' file.
:heavy_check_mark: has a 'codemeta.json' file.
:heavy_check_mark: has a 'contributing' file.
:heavy_check_mark: uses 'roxygen2'.
:heavy_check_mark: 'DESCRIPTION' has a URL field.
:heavy_check_mark: 'DESCRIPTION' has a BugReports field.
:heavy_check_mark: Package has at least one HTML vignette
:heavy_check_mark: All functions have examples.
:heavy_check_mark: Package has continuous integration checks.
:heavy_multiplication_x: Package coverage is 70.7% (should be at least 75%).
:heavy_check_mark: R CMD check found no errors.
:heavy_check_mark: R CMD check found no warnings.

Important: All failing checks above must be addressed prior to proceeding

Package License: MIT + file LICENSE

1. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has: - code in R (100% in 5 files) and - 1 authors - 1 vignette - no internal data file - 3 imported packages - 5 exported functions (median 10 lines of code) - 7 non-exported functions in R (median 12 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-----:|----------:|:----------| |files_R | 5| 29.8| | |files_vignettes | 1| 64.8| | |files_tests | 3| 71.1| | |loc_R | 68| 4.5|TRUE | |loc_vignettes | 106| 54.0| | |loc_tests | 43| 20.1| | |num_vignettes | 1| 60.7| | |n_fns_r | 12| 9.0| | |n_fns_r_exported | 5| 21.0| | |n_fns_r_not_exported | 7| 7.8| | |n_fns_per_file_r | 1| 11.5| | |num_params_per_fn | 2| 10.7| | |loc_per_fn_r | 10| 43.9| | |loc_per_fn_r_exp | 10| 23.4| | |loc_per_fn_r_not_exp | 12| 55.9| | |rel_whitespace_R | 25| 11.8| | |rel_whitespace_vignettes | 52| 78.8| | |rel_whitespace_tests | 44| 72.6| | |doclines_per_fn_exp | 15| 5.7| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 4| 9.7| | ---

1a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package

2. `goodpractice` and other checks

Details of goodpractice and other checks (click to open)

#### 3a. Continuous Integration Badges [![https\:\/\/github](https://github.com/cboettig/gbifdb/workflows/R-CMD-check/badge.svg)](https\:\/\/github) **GitHub Workflow Results** |name |conclusion |sha |date | |:-----------|:----------|:------|:----------| |R-CMD-check |success |3cf542 |2021-12-07 | --- #### 3b. `goodpractice` results #### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) rcmdcheck found no errors, warnings, or notes #### Test coverage with [covr](https://covr.r-lib.org/) Package coverage: 70.73 #### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) No functions have cyclocomplexity >= 15 #### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found no issues with this package!

Package Versions

|package |version | |:--------|:---------| |pkgstats |0.0.3.56 | |pkgcheck |0.0.2.195 |

Editor-in-Chief Instructions:

Processing may not proceed until the items marked with :heavy_multiplication_x: have been resolved.

jhollist commented 2 years ago

Editor checks:

[x] Documentation: The package has sufficient documentation available online (README, pkgdown docs) to allow for an assessment of functionality and scope without installing the package. In particular,
- [x] Is the case for the package well made?
- [x] Is the reference index page clear (grouped by topic if necessary)?
- [x] Are vignettes readable, sufficiently detailed and not just perfunctory?
[x] Fit: The package meets criteria for fit and overlap.
[x] Installation instructions: Are installation instructions clear enough for human users?
[x] Tests: If the package has some interactivity / HTTP / plot production etc. are the tests using state-of-the-art tooling?
[x] Contributing information: Is the documentation for contribution clear enough e.g. tokens for tests, playgrounds?
[x] License: The package has a CRAN or OSI accepted license.
[x] Project management: Are the issue and PR trackers in a good shape, e.g. are there outstanding bugs, is it clear when feature requests are meant to be tackled?

Editor comments

As you point out in your Package Description, there is overlap with rgbif. You make the case that the difference is with generality and performance. I am not too worried about this but I will ask the reviewers to think about it and would ask you to think a bit more on what distinguishes gbifdb from rgbif. At a minimum, the case you make in the DESCRIPTION, would be useful to have on the README as well.
Link to code of conduct in CONTRIBUTING.md is busted
As we discussed, your current test suite is fine so you can ignore that X!

I will start working on finding reviewers. Looking forward to working on this!

jhollist commented 2 years ago

@ropensci-review-bot assign @Pakillo as reviewer

ropensci-review-bot commented 2 years ago

@Pakillo added to the reviewers list. Review due date is 2021-12-29. Thanks @Pakillo for accepting to review! Please refer to our reviewer guide.

ropensci-review-bot commented 2 years ago

@Pakillo: If you haven't done so, please fill this form for us to update our reviewers records.

jhollist commented 2 years ago

@ropensci-review-bot assign @stephhazlitt as reviewer

ropensci-review-bot commented 2 years ago

@stephhazlitt added to the reviewers list. Review due date is 2022-01-03. Thanks @stephhazlitt for accepting to review! Please refer to our reviewer guide.

ropensci-review-bot commented 2 years ago

@stephhazlitt: If you haven't done so, please fill this form for us to update our reviewers records.

jhollist commented 2 years ago

@cboettig Given the time of year, I was planning on giving the reviewers a bit more time to complete the reviews. Would you be opposed to me setting the review due dates to Jan 10?

cboettig commented 2 years ago

@jhollist certainly! :snowman_with_snow:

jhollist commented 2 years ago

Thanks @cboettig!

@stephhazlitt and @Pakillo, note the extension on the due date. I set them for 2022-01-10.

jhollist commented 2 years ago

@stephhazlitt and @Pakillo just a quick follow up to see how the reviews are coming along. Just holler if you need a bit more time.

stephhazlitt commented 2 years ago

{gbifdb} Package Review

Thank you for the opportunity to test drive and review this great new R package. {gbifdb} brings together many things near and dear to me -- easier access to lots and lots of open biodiversity data 🦋🦉🦡, R, and leveraging new open source tools like Arrow 🏹 and DuckDB 🦆 to make all of this faster and possible on a "regular" laptop (for my review I used a MacBook Pro M1 2020). I hope this feedback is helpful, please feel free to reach out if there are follow up questions or if there is anything I can do to support moving forward.

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work.

Documentation

The package includes all the following forms of documentation:

[x] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s): demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[x] Examples: (that run successfully locally) for all exported functions
[x] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: 5ish

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Remote Data Access

I was able to successfully pull some GBIF data using gbif_remote() out of the box 🎉 :

library(gbifdb)
library(dplyr)
library(tictoc)

gbif_s3 <- gbif_remote()

tic()
df <- gbif_s3 |>
  filter(species == "Haematopus bachmani",
         year > 2018,
         stateprovince == "British Columbia") |> 
  count(year) |> 
  collect() 
toc()

## Error: IOError: AWS Error [code 99]: curlCode: 28, Timeout was reached
## 6098.66 sec elapsed

I was initially surprised how long it took though, a bit over an hour and a half, but then I came around to that being probably expected given my not great 50:20 copper internet plan. I think letting users know roughly what to expect re: compute time would be useful for managing expectations and helping make workflow decisions (to download or not to download).
I also encountered the curl error above, which I assumed was gremlins with a remote workflow so I just tried again and voila. It might be worth proactively documenting this potential outcome with some try-try-again troubleshooting advice?
I spent some time on the GBIF website and used the example data and output from the colnames(gbif) in the local data section of the README to get familiar with the GBIF data first. I think it would be useful to provide new (GBIF) users with the full name, the url and a couple more sentences on GBIF, what data is available (and what is not) and the easiest way to access metadata.
A possible future enhancement might be some helper functions to get this metadata into the R environment (column names, data dictionaries, even if stored in the package?) to help users filter and summarise efficiently

Local Data Access

I started to download the GBIF data set to run over night, only to find that my laptop went to sleep on me in the wee hours which borked the process. Maybe some user tips for getting a machine ready for a big download?
As a time-saver I only downloaded about 24 parquet files, but it was enough to validate that gbif_download() works great 🎉 and provide some data to work with (note I deleted the last, partial downloaded parquet file)
I think an estimate, even rough or a minimum, re: how long it might take on "average" to download all the GBIF data would be useful
I specified where the data was stored, and then had to add the .../occurence.parquet/ to the end of that path for gbif_conn()---it might be useful to add an example of this approach to the documentation:

gbif_download(dir = "/Users/stephhazlitt/gbif-data/")

gbif_dir <- "/Users/stephhazlitt/gbif-data/occurrence.parquet/"
conn <- gbif_conn(gbif_dir)

gbif <- tbl(conn, "gbif")
gbif

Acknowledging that the intended focus of the package is around a relational database interface to parquet-based serializations of GBIF data, I think it would also be worth fleshing out the {arrow} approach to achieving the same end result---the equivalent as outlined in the remote section's arrow vs to_duckdb = TRUE pathways. It looks like the docs were written virtually at the same time as the {arrow} 6.0.0 release, which brought a lot of added {dplyr} support to {arrow} making this a reasonable alternative approach, at least with the amount of data I downloaded and the example data.
Querying the example data using gbif_conn()+ tbl() or using arrow::open_dataset():

library(gbifdb)
library(dplyr)
library(arrow)
library(ggplot2)

gbif_parquet <- gbif_example_data()

## database 
conn_example <- gbif_conn(gbif_parquet)
gbif_example <- tbl(conn_example, "gbif")

gbif_example |> 
  group_by(order, phylum) |> 
  summarise(n_species = n_distinct(species)) |> 
  collect() |> 
  ggplot() +
  geom_col(aes(n_species, order, fill = phylum))

## arrow
open_dataset(gbif_parquet) |> 
  group_by(order, phylum) |> 
  summarise(n_species = n_distinct(species)) |> 
  collect() |> 
  ggplot() +
  geom_col(aes(n_species, order, fill = phylum))

I am not sure what the current design is around the GBIF parquet files, however if at some point the GBIF parquet serializations were partitioned by common query parameters (e.g. year, month) I think the {gbifdb} remote and local queries would be (even) faster---it might be interesting to experiment with partitions.

Minor Comments

It would be great to see Contribution Guidelines added as either a stand alone .md file or in the README.md file
I think adding Citation information would also be beneficial, again in the README.md or a CITATION file
My local Check Build output flagged a typo in gbif_download.R (occurence -> occurrence)
Adding top-level package documentation so ?gbifdb works for users is always nice in my experience
Adding licence and code coverage badges for easy scanning would be nice, maybe when all the rOpenSci badges are added
Including the url to {rgbif} would be a convenience for users exploring workflow options https://docs.ropensci.org/rgbif/

cboettig commented 2 years ago

Thanks @stephhazlitt this is super :tada: ! One design question you bring up which I'd really like advice on:

I completely hear you that it's awkward that gbif_conn() returns a remote connection to a database, while gbif_remote() returns a remote table; i.e. that in the former we need to remember to use dplyr::tbl() but with the arrow-based interfaces we don't. This kind of begs for a gbif_local(), which would just be tbl(gbif_conn(), "gbif"). So Q1 Is it worth making dplyr a hard dependency in order to add gbif_local()?

Currently, dplyr is only a soft dependency, since presumably some applications or users will just want the DBI connection object for which they are happy querying in SQL (some users/apps will know SQL well and may not like using dplyr's doing it for them). I think of dplyr as a heavy dependency, but it's also common so maybe I should just bite that bullet?

Q2 I'm also obviously on the fence as to whether gbif_remote() should just go ahead and call to_duckdb() or not. Doing so would make the interface a bit more consistent between the two connection types. OTOH, to_duckdb() can actually slow things down a bit a present when doing queries that arrow already knows how to do, which as you noted is quite a lot in arrow 6.0.1. Thoughts on either of these?

I totally hear you on the timing and network issues. I'll add some comments in the README and function docs. One interesting case to mention might be increasingly-popular cloud-hosted computing services, like https://rstudio.cloud. Similarly, I teach my courses using RStudio instances Berkeley provides that are similarly dinky in RAM and CPU, but because they are hosted on Google Cloud have access to good bandwidth. I think when using AWS-EC2 instances in the matching region as GBIF uses would see particularly impressive performance, comparable to the local download, but almost all cloud providers have rather fast network performance, even on instances that have less RAM or CPU power than a dinky laptop.

Benchmarking network issues is tricky though, or at least I've never been properly trained how to do it! Maybe some folks from the arrow team could advice me on that. With colleagues over at GBIF, we put together this draft blog post very recently: https://github.com/gbif/data-blog/blob/master/content/post/2022-01-04-make-big-queries-using-GBIF-parquet-snapshots.md. My local laptop is a crappy Window-Go tablet that struggles to run R, but I log in remotely to an rstudio session on my desktop at work where we actually have gigabit ethernet, so I see runtimes that are almost an order of magnitude faster than John reports for remote connections, and fewer connection issues.

Also good point about the laptop falling asleep on the whole-data download - can add notes to the docs about that too. It's about a 100 GB download. Google tells me the average download speed in the US is over 100MB/s but that sounds ridiculous, let's say you get 10 MB/s downloads it ought to take 2.7 hours I guess?

I may also be worth noting that while 100 GB isn't all that large for a laptop, it's much bigger than most typical cloud providers will provide, so neon_remote() may be the only option there too.

stephhazlitt commented 2 years ago

Great questions.

I am biased towards a fully R workflow, so not sure my thoughts are representative of your user community, however I would go with the {dplyr} dependency and aim for a symmetrical 2-function per GBIF data location design:

gbif_remote() - as is, arrow table connection gbif_remore_db() - wrapper for above with to_duckdb = TRUE

gbif_local() - wrapper for arrow::open_dataset() to read locally stored parquet data gbif_local_db() - for tbl(gbif_conn(), "gbif")

My thinking is that this brings maximum flexibility for users and the package, given tools and dependencies are all changing and improving so quickly.

If you really wanted to provide a lightweight version for non-{dplyr} users, perhaps you could make a one function package for gbif_conn() and then use that package as a dependency in {gbifdb}? I am channeling learnings from the {bcdata} experience, where for sure if we had a redo it would be more than one package.

cboettig commented 2 years ago

Nice, I like it! Just hashing out function names a bit more here:

I think going from 2 to functions (gbif_remote() an gbif_local()) to four seems a bit overkill. I imagine the end user doesn't care so much about whether it's duckdb vs arrow at the backend (particularly when functions like to_duckdb() make an arrow connection use duckdb or vice versa. In particular, I'm not sure what we achieve by having the the two different flavors of gbif_local() / gbif_local_db(). Sure the duckdb-based one supports nearly the full dbplyr functionality, while arrow doesn't do windowed operations (at least not yet), but that seems a more minor difference. (technically, duckdb can access remote parquet functions directly without arrow, which means we could drop the arrow dependency and gbif_remote() would match gbif_local() use for all cases without needing to_duckdb() to use windowed operations.... but so far, it's not enabled by default in the R client, https://github.com/duckdb/duckdb/issues/2177)

re audience, I'm with you on a fully R workflow, I just meant that there's probably a subset of old-school R users out there who have been using the DBI R package since before dplyr/dbplyr and are probably happy to keep doing so. But I can keep gbif_conn() as a function for them but still make dplyr a hard dependency I think. (I'm not sure, but this might mean dbplyr should be a dependency too?) The two-package idea is a good one, but I think overkill here since this this is already a thin wrapper around duckdb and arrow.

Pakillo commented 2 years ago

Hi, apologies for the delay. Still playing with it. Hope to submit my review within a couple of days

Pakillo commented 2 years ago

{gbifdb} Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[X] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[X] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
[X] Installation instructions: for the development version of package and any non-standard dependencies in README
[X] Vignette(s): demonstrating major functionality that runs successfully locally
[X] Function Documentation: for all exported functions
[x] Examples: (that run successfully locally) for all exported functions
[X] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[X] Installation: Installation succeeds as documented.
[X] Functionality: Any functional claims of the software been confirmed.
[X] Performance: Any performance claims of the software been confirmed.
[X] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[X] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: 12

[X] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Thanks a lot for the invitation to review this package. I hesitated to accept the invitation at first. First, I doubt I can ever improve anything @cboettig has written. Second, although I have some modest experience using databases with dplyr, I have never used arrow or duckdb. But, I have learnt a lot reviewing gbifdb and, as a frequent GBIF user, my own work will benefit directly from this exercise (and the package itself, of course). I do hope my review is useful, at least from the perspective of an intermediate R user keen to start using the package.

README

The README/vignette is already very informative and nice. Some suggestions for improving it:

I think it would be good to mention somewhere that this GBIF dataset is a snapshot of "all CC0 and CC-BY licensed data in GBIF that have coordinates which passed automated quality checks", as explained here. From there I understand that occurrence records without coordinates will not be included in such dataset. I think some future users may be tricked by this, e.g. if they query number of records for some species over time, ignoring that records without coordinates will be missing.
"gbifdb has few dependencies: only duckdb and DBI are required". But the DESCRIPTION file also imports arrow?
"gbifdb currently requires the dev version of duckdb". Is this still correct? I installed the latest duckdb version from GitHub as suggested (version 0.3.2). Yet the DESCRIPTION file specifies duckdb (>= 0.2.9). And CRAN already has version 0.3.1-1. Installing duckdb from github took quite a few minutes in my Ubuntu laptop, so if the CRAN version is already good, it would nice to update the README instructions accordingly. Otherwise update the version requirement in DESCRIPTION Imports?
From our tests and also your post at the GBIF data blog, it looks like gbifdb can be much much slower than rgbif for certain queries, e.g. retrieving occurrence records for a single species. I did not expect such a striking difference at first. I think it would be good to warn users about this already in the README. That is, for some use cases (like retrieving records for a single/few species or countries), it may be better to use rgbif than gbifdb, right?
From my tests, it also looks like choosing the closest AWS region can speed up queries (and downloading) considerably. Hence, it might be nice to mention this already in the README, e.g. when first showing the use of gbif_remote, including an example with a different region than the default (also, is "af-south-1" the best default?). Otherwise many people may overlook this and wait unnecessarily long times (happened to me at first). The help of gbif_remote could also be more informative about how to choose the best region in each case (eg. see https://github.com/Pakillo/gbifdb/commit/e34d465fd97e45515f61d33ff9f90dda5a2f001f).
When showing the use of gbif_remote, it might be useful to include an entire example using collect, for those users not so familiar with databases, e.g.

gbif %>% 
  filter(species == "Abies pinsapo",
         countrycode == "ES",
         year > 2000) %>% 
  count(year) %>%
  collect()

# A tibble: 13 × 2
    year     n
   <int> <int>
 1  2017     4
 2  2020    44
 3  2018    24
 4  2019    61
 5  2016     6
 6  2021     7
 7  2006     5
 8  2004     1
 9  2015     3
10  2014     2
11  2013     2
12  2009     1
13  2010     1

Also, I think it would be nice to include a link to this table describing all the available columns. Otherwise it can be hard to figure out how to filter a dataset, which variables can be used, etc.

Downloading the database

gbif_download looks really handy, but I struggled a bit to download the entire database, even using a good (fiber) internet connection.

This is the code I used to download:

gbif_download(dir = "/media/frs/Elements/gbifdb/", 
  region = "eu-central-1", 
  bucket = "gbif-open-data-eu-central-1")

My first attempt failed within 20 minutes:

# <== Saving object 'occurrence/2021-11-01/occurrence.parquet/000129' to '/media/frs/Elements/gbifdb///occurrence.parquet/000129'
# Error in curl::curl_fetch_disk(url, x$path, handle = handle) : 
#   Timeout was reached: [s3-eu-central-1.amazonaws.com] Resolving timed out after 10000 milliseconds

A second attempt gave an error after a while:

# <== Saving object 'occurrence/2021-11-01/occurrence.parquet/001044' to '/media/frs/Elements/gbifdb///occurrence.parquet/001044'
# 131 local files not found in bucket 'gbif-open-data-eu-central-1'
# List of 4
# $ Code     : chr "AccessDenied"
# $ Message  : chr "Access Denied"
# $ RequestId: chr "CA2X7EK3FCXH7RP2"
# $ HostId   : chr "9h1TicaLFACsutywE7S7etD/yuYckCyxHKbnDfL7zd7hJv2PTCqPxkV8m7gwtxKCtWPCnQEPca0="
# - attr(*, "headers")=List of 7
# ..$ x-amz-request-id : chr "CA2X7EK3FCXH7RP2"
# ..$ x-amz-id-2       : chr "9h1TicaLFACsutywE7S7etD/yuYckCyxHKbnDfL7zd7hJv2PTCqPxkV8m7gwtxKCtWPCnQEPca0="
# ..$ content-type     : chr "application/xml"
# ..$ transfer-encoding: chr "chunked"
# ..$ date             : chr "Thu, 13 Jan 2022 11:22:45 GMT"
# ..$ server           : chr "AmazonS3"
# ..$ connection       : chr "close"
# ..- attr(*, "class")= chr [1:2] "insensitive" "list"
# - attr(*, "class")= chr "aws_error"
# NULL
# Error in parse_aws_s3_response(r, Sig, verbose = verbose) : 
#   Forbidden (HTTP 403).

Third attempt also gave similar error, even though it looks like all the files had been successfully downloaded (1045 parquet files plus the citation file):

# <== Saving object 'occurrence/2021-11-01/occurrence.parquet/001044' to '/media/frs/Elements/gbifdb///occurrence.parquet/001044'
# 1046 local files not found in bucket 'gbif-open-data-eu-central-1'
# List of 4
#  $ Code     : chr "AccessDenied"
#  $ Message  : chr "Access Denied"
#  $ RequestId: chr "21VETY51D6FQ6K3A"
#  $ HostId   : chr "couGo2j17BhB5a8k3GFjI8gIrO3akAbRzCaikOOk30sEIr7d4nCwBLcZRdA0wlYAwGF969q2KZ4="
#  - attr(*, "headers")=List of 7
#   ..$ x-amz-request-id : chr "21VETY51D6FQ6K3A"
#   ..$ x-amz-id-2       : chr "couGo2j17BhB5a8k3GFjI8gIrO3akAbRzCaikOOk30sEIr7d4nCwBLcZRdA0wlYAwGF969q2KZ4="
#   ..$ content-type     : chr "application/xml"
#   ..$ transfer-encoding: chr "chunked"
#   ..$ date             : chr "Thu, 13 Jan 2022 13:55:31 GMT"
#   ..$ server           : chr "AmazonS3"
#   ..$ connection       : chr "close"
#   ..- attr(*, "class")= chr [1:2] "insensitive" "list"
#  - attr(*, "class")= chr "aws_error"
# NULL
# Error in parse_aws_s3_response(r, Sig, verbose = verbose) : 
#   Forbidden (HTTP 403).

In general, asking from ignorance, is there any way to make the download easier/less fragile to the user? Also, it seems the download starts from scratch every time you call gbif_download (that is, files already on disk are replaced again, even if requesting the same GBIF snapshot)?

Also perhaps gbif_download documentation may be improved a bit, e.g. see https://github.com/Pakillo/gbifdb/commit/0ddc505b4626e158871643fafe991a795d3c77a6.

Querying the local database

As @stephhazlitt noted, it would be nice to add an example to gbif_conn with a user-provided dir. My first attempt failed:

con <- gbif_conn(dir = "/media/frs/Elements/gbifdb/")
Error in .local(conn, statement, ...) : 
  duckdb_prepare_R: Failed to prepare query CREATE VIEW 'gbif' AS SELECT * FROM parquet_scan('/media/frs/Elements/gbifdb//*');
Error: Invalid Input Error: No magic bytes found at end of file '/media/frs/Elements/gbifdb/citation.txt'

You have to add "occurrence.parquet" for it to work:

con <- gbif_conn(dir = "/media/frs/Elements/gbifdb/occurrence.parquet/")

Once connected to the database, I was able to reproduce all the examples in the readme/vignette. I also managed to query some data for a random species:

library(gbifdb)
library(dplyr)  

con <- gbif_conn(dir = "/media/frs/Elements/gbifdb/occurrence.parquet/")
gbif <- tbl(con, "gbif")

pinsapo <- gbif %>% 
  filter(species == "Abies pinsapo",
         countrycode == "ES",
         year > 2010) %>% 
  select(species, year, decimallatitude, decimallongitude)

pinsapo
#> # Source:   lazy query [?? x 4]
#> # Database: duckdb_connection
#>    species        year decimallatitude decimallongitude
#>    <chr>         <int>           <dbl>            <dbl>
#>  1 Abies pinsapo  2013            36.7            -4.96
#>  2 Abies pinsapo  2013            36.8            -5.39
#>  3 Abies pinsapo  2021            36.7            -5.04
#>  4 Abies pinsapo  2018            40.5            -4.1 
#>  5 Abies pinsapo  2020            40.9            -4   
#>  6 Abies pinsapo  2019            41.7            -3.7 
#>  7 Abies pinsapo  2021            36.7            -5.04
#>  8 Abies pinsapo  2021            36.7            -5.04
#>  9 Abies pinsapo  2019            40.5            -3.3 
#> 10 Abies pinsapo  2019            40.7            -4.7 
#> # … with more rows

However, there seems to be a problem when two special columns (mediatype and issue) are selected for return in the results table, causing R session to crash (see https://github.com/cboettig/gbifdb/issues/2). I hope this will be easy to fix. Otherwise it should be warned in the package documentation.

Minor comments

The pkgdown website (https://cboettig.github.io/gbifdb) is not up to date with the latest changes in the package.

cboettig commented 2 years ago

@Pakillo Thanks for this amazing review (and the help tracking down the issues with in https://github.com/cboettig/gbifdb/issues/2)! Still working through it, but wanted to jot down what I've tried to address so far (See diff in PR#3).

I've updated the default bucket to us-east-1 instead of South Africa (A biased default to be sure, but also what I believe most AWS clients default to when nothing is set). I've tried to note in the docs that users probably want to set the bucket accordingly based on their location.
Not entirely sure what was up with gbif_download(), I'll test more too, but I've made a few tweaks. Like you say, setting a closer region helps, and I've made sure to unset alternative credentials more aggressively. My understanding of aws.s3::s3sync() behavior is that it is checking md5sums so should not re-download, but I haven't tested to confirm that. Now that arrow is a dependency I'm considering using it's S3-interface to do the download instead, but it is much more low-level. It's quite possible that the awscli tool (as shown at https://registry.opendata.aws/gbif/) at or other such tools give significantly better performance. Maybe I should point users at that and deprecate this function...

I think I've addressed some of the performance issues you identified in gbif remote now. May still be slower than rgbif for small queries, (again depending on network speed) but will need to benchmark more.

more to come, gotta run

ropensci / software-review