ropensci / software-review

rOpenSci Software Peer Review.
295 stars 104 forks source link

gbifdb: Local Database Interface to 'GBIF' #492

Closed cboettig closed 2 years ago

cboettig commented 2 years ago

Date accepted: 2022-03-22 Submitting Author Name: Carl Boettiger Submitting Author Github Handle: !--author1-->@cboettig<!--end-author1-- Repository: https://github.com/cboettig/gbifdb Version submitted: 0.1.0 Submission type: Standard Editor: !--editor-->@jhollist<!--end-editor-- Reviewers: @Pakillo, @stephhazlitt

Due date for @Pakillo: 2022-01-10 Due date for @stephhazlitt: 2022-01-10

Archive: TBD Version accepted: TBD Language: en


Package: gbifdb
Version: 0.1.0
Title: Local Database Interface to 'GBIF'
Description: A high performance interface to the Global Biodiversity
  Information Facility, 'GBIF'.  In contrast to 'rgbif', which can
  access small subsets of 'GBIF' data through web-based queries to
  a central server, 'gbifdb' provides enhanced performance for R users
  performing large-scale analyses on servers and cloud computing
  providers, providing full support for arbitrary 'SQL' or 'dplyr'
  operations on the complete 'GBIF' data tables (now over 1 billion
  records, and over a terrabyte in size.) 'gbifdb' accesses a copy
  of the 'GBIF' data in 'parquet' format, which is already readily
  available in commercial computing clouds such as the Amazon Open
  Data portal and the Microsoft Planetary Computer, or can be 
  accessed directly or downloaded
  to any server with suitable bandwidth and storage space.
Authors@R: c(
            person("Carl", "Boettiger", , "cboettig@gmail.com", c("aut", "cre"),
                   comment = c(ORCID = "0000-0002-1642-628X"))
            )
License: MIT + file LICENSE
Encoding: UTF-8
ByteCompile: true
Depends: R (>= 4.0)
Imports:
    duckdb (>= 0.2.9),
    DBI
Suggests:
    spelling,
    dplyr,
    dbplyr,
    testthat (>= 3.0.0),
    covr,
    knitr,
    rmarkdown,
    aws.s3,
    arrow
URL: https://github.com/cboettig/gbifdb
BugReports: https://github.com/cboettig/gbifdb
Language: en-US
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.2
Config/testthat/edition: 3
VignetteBuilder: knitr

Scope

This package is similar in scope to rgbif, but allows direct access to the full occurrence table using dplyr functions with or without downloading.

Ecologists etc using / analyzing GBIF data. In particular, anyone attempting analyses which need the whole GBIF data corpus, like the example shown in the README, cannot easily do so with the standard API / rgbif approach.

As noted above, this is similar in scope to rgbif, but more general and faster.

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

MEE Options - [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

ropensci-review-bot commented 2 years ago

Missing values: author1, repourl, submission-type

maelle commented 2 years ago

I fixed the submission (HTML comments were missing, they're needed for the bot @cboettig :slightly_smiling_face:).

ldecicco-USGS commented 2 years ago

@ropensci-review-bot check package

ropensci-review-bot commented 2 years ago

Thanks, about to send the query.

ropensci-review-bot commented 2 years ago

:rocket:

Editor check started

:wave:

ropensci-review-bot commented 2 years ago

Checks for gbifdb (v0.1.0)

git hash: 46b2cb4f

Important: All failing checks above must be addressed prior to proceeding

Package License: MIT + file LICENSE


1. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has: - code in R (100% in 4 files) and - 1 authors - 1 vignette - no internal data file - 3 imported packages - 5 exported functions (median 9 lines of code) - 5 non-exported functions in R (median 12 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-----:|----------:|:----------| |files_R | 4| 23.3| | |files_vignettes | 1| 64.8| | |files_tests | 3| 71.1| | |loc_R | 52| 3.1|TRUE | |loc_vignettes | 106| 54.0| | |loc_tests | 24| 12.8| | |num_vignettes | 1| 60.7| | |n_fns_r | 10| 6.8| | |n_fns_r_exported | 5| 21.0| | |n_fns_r_not_exported | 5| 4.0|TRUE | |n_fns_per_file_r | 1| 12.6| | |num_params_per_fn | 2| 10.7| | |loc_per_fn_r | 10| 39.1| | |loc_per_fn_r_exp | 9| 20.3| | |loc_per_fn_r_not_exp | 12| 55.9| | |rel_whitespace_R | 31| 11.0| | |rel_whitespace_vignettes | 52| 78.8| | |rel_whitespace_tests | 38| 67.4| | |doclines_per_fn_exp | 15| 5.7| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 2| 3.4|TRUE | ---

1a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


2. goodpractice and other checks

Details of goodpractice and other checks (click to open)

#### 3a. Continuous Integration Badges [![https\:\/\/github](https://github.com/cboettig/gbifdb/workflows/R-CMD-check/badge.svg)](https\:\/\/github) **GitHub Workflow Results** |name |conclusion |sha |date | |:-----------|:----------|:------|:----------| |R-CMD-check |success |46b2cb |2021-11-30 | --- #### 3b. `goodpractice` results #### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) rcmdcheck found no errors, warnings, or notes #### Test coverage with [covr](https://covr.r-lib.org/) Package coverage: 53.12 The following files are not completely covered by tests: file | coverage --- | --- R/gbif_download.R | 0% R/gbif_remote.R | 60% #### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) No functions have cyclocomplexity >= 15 #### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found the following 1 potential issues: message | number of times --- | --- Lines should not be more than 80 characters. | 1


Package Versions

|package |version | |:--------|:---------| |pkgstats |0.0.3.52 | |pkgcheck |0.0.2.185 |


Editor-in-Chief Instructions:

Processing may not proceed until the items marked with :heavy_multiplication_x: have been resolved.

cboettig commented 2 years ago

Thanks! I think most of these issues are resolved in the repo now (e.g. by commit https://github.com/cboettig/gbifdb/commit/46b2cb4fb0b180a767c8f091d00533d024d3ea3c or prior).

Coverage is still less than 75% because I do not test gbif_download(). This function downloads over 100 GB of data over AWS sync. This is a thin wrapper around aws.s3::s3_sync(), and I don't think anything is gained by a mocking test, but happy to be convinced otherwise.

there's no CITATION file still since there's no publication to point to, but there's now a codemeta.json file :see_no_evil: .

please let me know if there's anything else I ought to address first!

ldecicco-USGS commented 2 years ago

@ropensci-review-bot check package

ropensci-review-bot commented 2 years ago

Thanks, about to send the query.

ropensci-review-bot commented 2 years ago

:rocket:

Editor check started

:wave:

ropensci-review-bot commented 2 years ago

Checks for gbifdb (v0.1.0)

git hash: 015b74e7

Important: All failing checks above must be addressed prior to proceeding

Package License: MIT + file LICENSE


1. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has: - code in R (100% in 5 files) and - 1 authors - 1 vignette - no internal data file - 3 imported packages - 5 exported functions (median 9 lines of code) - 5 non-exported functions in R (median 12 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-----:|----------:|:----------| |files_R | 5| 29.8| | |files_vignettes | 1| 64.8| | |files_tests | 3| 71.1| | |loc_R | 55| 3.4|TRUE | |loc_vignettes | 106| 54.0| | |loc_tests | 35| 17.1| | |num_vignettes | 1| 60.7| | |n_fns_r | 10| 6.8| | |n_fns_r_exported | 5| 21.0| | |n_fns_r_not_exported | 5| 4.0|TRUE | |n_fns_per_file_r | 1| 0.0|TRUE | |num_params_per_fn | 2| 10.7| | |loc_per_fn_r | 10| 39.1| | |loc_per_fn_r_exp | 9| 20.3| | |loc_per_fn_r_not_exp | 12| 55.9| | |rel_whitespace_R | 18| 6.8| | |rel_whitespace_vignettes | 52| 78.8| | |rel_whitespace_tests | 34| 69.2| | |doclines_per_fn_exp | 15| 5.7| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 2| 3.4|TRUE | ---

1a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


2. goodpractice and other checks

Details of goodpractice and other checks (click to open)

#### 3a. Continuous Integration Badges [![https\:\/\/github](https://github.com/cboettig/gbifdb/workflows/R-CMD-check/badge.svg)](https\:\/\/github) **GitHub Workflow Results** |name |conclusion |sha |date | |:-----------|:----------|:------|:----------| |R-CMD-check |failure |015b74 |2021-12-02 | --- #### 3b. `goodpractice` results #### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) R CMD check generated the following error: 1. checking tests ... Running ‘spelling.R’ Comparing ‘spelling.Rout’ to ‘spelling.Rout.save’ ... OK Running ‘testthat.R’ ERROR Running the tests in ‘tests/testthat.R’ failed. Last 13 lines of output: 2. └─arrow::s3_bucket(bucket, ...) 3. └─FileSystem$from_uri(bucket) 4. └─arrow:::fs___FileSystemFromUri(uri) ── Error (test_gbifdb.R:39:3): gbif_remote(to_duckdb=TRUE)....slow! ──────────── Error: NotImplemented: Got S3 URI but Arrow compiled without S3 support Backtrace: █ 1. └─gbifdb::gbif_remote(to_duckdb = TRUE) test_gbifdb.R:39:2 2. └─arrow::s3_bucket(bucket, ...) 3. └─FileSystem$from_uri(bucket) 4. └─arrow:::fs___FileSystemFromUri(uri) [ FAIL 2 | WARN 1 | SKIP 0 | PASS 4 ] Error: Test failures Execution halted R CMD check generated the following test_fail: 1. > library(testthat) > library(gbifdb) > > test_check("gbifdb") See arrow_info() for available features Attaching package: 'arrow' The following object is masked from 'package:testthat': matches The following object is masked from 'package:utils': timestamp Attaching package: 'dplyr' The following object is masked from 'package:testthat': matches The following objects are masked from 'package:stats': filter, lag The following objects are masked from 'package:base': intersect, setdiff, setequal, union ══ Warnings ════════════════════════════════════════════════════════════════════ ── Warning (test_gbifdb.R:1:1): (code run outside of `test_that()`) ──────────── `context()` was deprecated in the 3rd edition. Backtrace: 1. testthat::context("test gbifdb") test_gbifdb.R:1:0 2. testthat:::edition_deprecate(3, "context()") ══ Failed tests ════════════════════════════════════════════════════════════════ ── Error (test_gbifdb.R:30:3): gbif_remote() ─────────────────────────────────── Error: NotImplemented: Got S3 URI but Arrow compiled without S3 support Backtrace: █ 1. └─gbifdb::gbif_remote(to_duckdb = FALSE) test_gbifdb.R:30:2 2. └─arrow::s3_bucket(bucket, ...) 3. └─FileSystem$from_uri(bucket) 4. └─arrow:::fs___FileSystemFromUri(uri) ── Error (test_gbifdb.R:39:3): gbif_remote(to_duckdb=TRUE)....slow! ──────────── Error: NotImplemented: Got S3 URI but Arrow compiled without S3 support Backtrace: █ 1. └─gbifdb::gbif_remote(to_duckdb = TRUE) test_gbifdb.R:39:2 2. └─arrow::s3_bucket(bucket, ...) 3. └─FileSystem$from_uri(bucket) 4. └─arrow:::fs___FileSystemFromUri(uri) [ FAIL 2 | WARN 1 | SKIP 0 | PASS 4 ] Error: Test failures Execution halted R CMD check generated the following check_fail: 1. rcmdcheck_tests_pass #### Test coverage with [covr](https://covr.r-lib.org/) ERROR: Test Coverage Failed #### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) No functions have cyclocomplexity >= 15 #### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found no issues with this package!


Package Versions

|package |version | |:--------|:---------| |pkgstats |0.0.3.54 | |pkgcheck |0.0.2.187 |


Editor-in-Chief Instructions:

Processing may not proceed until the items marked with :heavy_multiplication_x: have been resolved.

mpadge commented 2 years ago

@cboettig This is the first submission with an arrow dependency, for which standard installation of system libraries are inadequate. I'll rebuild the check system with a pre-install of arrow, and your package will at least pass R CMD check. Thanks for the good test case!

ldecicco-USGS commented 2 years ago

@ropensci-review-bot assign @jhollist as editor

ropensci-review-bot commented 2 years ago

Assigned! @jhollist is now the editor

cboettig commented 2 years ago

Thanks @mpadge , but it's good to discover these errors now! I forgot that arrow could optionally be built without S3 support. I've added checks so the test will skip if arrow::arrow_info() reports that we arrow was built without s3 capability. (No doubt CRAN has at least one machine where that is also the case. Realistically I will need to skip that check on CRAN too I think; the operation is too network-intensive to be robust and again I don't think there's much value in creating a mock test that passes even when the network is unavailable).

jhollist commented 2 years ago

Hi,@cboettig! Looking forward to working on this with you! I am full up today, but wanted to let you know that I have this on my radar and will start working on it early next week.

jhollist commented 2 years ago

@mpadge Have you had a chance to add arrow to the check system? I want to hold off running the package checks until I know that is ready to roll.

mpadge commented 2 years ago

Yep @jhollist it's up and rolling with arrow pre-compiled for S3 support. You may check away ...

jhollist commented 2 years ago

@ropensci-review-bot check package

ropensci-review-bot commented 2 years ago

Thanks, about to send the query.

ropensci-review-bot commented 2 years ago

:rocket:

Editor check started

:wave:

ropensci-review-bot commented 2 years ago

Checks for gbifdb (v0.1.0)

git hash: 2c58f8e7

Important: All failing checks above must be addressed prior to proceeding

Package License: MIT + file LICENSE


1. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has: - code in R (100% in 5 files) and - 1 authors - 1 vignette - no internal data file - 3 imported packages - 5 exported functions (median 9 lines of code) - 5 non-exported functions in R (median 12 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-----:|----------:|:----------| |files_R | 5| 29.8| | |files_vignettes | 1| 64.8| | |files_tests | 3| 71.1| | |loc_R | 56| 3.5|TRUE | |loc_vignettes | 106| 54.0| | |loc_tests | 43| 20.1| | |num_vignettes | 1| 60.7| | |n_fns_r | 10| 6.8| | |n_fns_r_exported | 5| 21.0| | |n_fns_r_not_exported | 5| 4.0|TRUE | |n_fns_per_file_r | 1| 0.0|TRUE | |num_params_per_fn | 2| 10.7| | |loc_per_fn_r | 10| 39.1| | |loc_per_fn_r_exp | 9| 20.3| | |loc_per_fn_r_not_exp | 12| 55.9| | |rel_whitespace_R | 20| 7.5| | |rel_whitespace_vignettes | 52| 78.8| | |rel_whitespace_tests | 44| 72.6| | |doclines_per_fn_exp | 15| 5.7| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 2| 3.4|TRUE | ---

1a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


2. goodpractice and other checks

Details of goodpractice and other checks (click to open)

#### 3a. Continuous Integration Badges [![https\:\/\/github](https://github.com/cboettig/gbifdb/workflows/R-CMD-check/badge.svg)](https\:\/\/github) **GitHub Workflow Results** |name |conclusion |sha |date | |:-----------|:----------|:------|:----------| |R-CMD-check |success |2c58f8 |2021-12-04 | --- #### 3b. `goodpractice` results #### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) rcmdcheck found no errors, warnings, or notes #### Test coverage with [covr](https://covr.r-lib.org/) Package coverage: 66.67 The following files are not completely covered by tests: file | coverage --- | --- R/gbif_download.R | 0% #### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) No functions have cyclocomplexity >= 15 #### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found no issues with this package!


Package Versions

|package |version | |:--------|:---------| |pkgstats |0.0.3.54 | |pkgcheck |0.0.2.195 |


Editor-in-Chief Instructions:

Processing may not proceed until the items marked with :heavy_multiplication_x: have been resolved.

cboettig commented 2 years ago

@jhollist Thanks much!

I've added a contributing.md (usethis::use_tidy_contributing()), apologies for missing that one.

As mentioned above, I don't think it makes sense to add CITATION at this stage, (R will already do the best option via citation(), citeAs will recognize the codemeta.json (and I think the DESCRIPTION), Zenodo will I think continue to ignore all these and go by the github metadata or it's own .zenodo.json override instead). I love metadata but I'm not convinced data entry is improved by so much duplication at this stage.... If this ever has an external doc or paper to cite, I'll certainly add that to CITATION at that time....

Also not sure I can do much about code coverage without resorting to silly hacks. (I totally appreciate the spirit of the rule, but am never fond of putting to much emphasis on such statistics which become relatively easy to hack which only further erodes their utility....)

I think we're clear on the other checks? Cool auto-check system! Let me know what's next!

jhollist commented 2 years ago

@cboettig A couple of items:

  1. Go ahead and still add the CITATION file (see: https://devguide.ropensci.org/building.html?q=citatio#citation-file). Having this in there will allow you to track any Zenodo DOIs that get minted (although not necessary right now). Also, it might be useful to provide a citation for the GBIF data/AWS service.

  2. I agree with you on the testing. I think you have a fairly complete test suite. I will probably ask the reviewers to think about this one as well, just to make sure we aren't missing something. That being said, nothing to do right now. I am happy to send on to reviewers with the current test suite.

  3. I can't get the dev version of duckdb installed, but looks like the version on CRAN is > what you require in the DESCRIPTION.

  4. When I run gbif_remote() I get

    Error: IOError: When getting information for key 'occurrence/2021-11-01/occurrence.parquet' in bucket 'gbif-open-data-af-south-1': AWS Error [code 15]: No response body.

    I tried with different version and different bucket but did not work. Is this my machine (session info below) or do you see the same thing?

So next steps are to get the CITATION, double check that user needs the dev version of duckdb or if CRAN is fine, make sure gbif_remote() is working. After that I have some other local checks to do then I can pass it on to reviewers.

Make sense?

My Session Info:

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.0.7  gbifdb_0.1.0

loaded via a namespace (and not attached):
 [1] arrow_6.0.1      fansi_0.5.0      assertthat_0.2.1 utf8_1.2.2       crayon_1.4.2     R6_2.5.1        
 [7] DBI_1.1.1        lifecycle_1.0.1  magrittr_2.0.1   pillar_1.6.4     rlang_0.4.12     remotes_2.4.1   
[13] vctrs_0.3.8      generics_0.1.1   ellipsis_0.3.2   tools_4.1.0      bit64_4.0.5      glue_1.5.1      
[19] bit_4.0.4        purrr_0.3.4      compiler_4.1.0   pkgconfig_2.0.3  tidyselect_1.1.1 tibble_3.1.6 
cboettig commented 2 years ago

Thanks @jhollist! That makes perfect sense. I've added CITATION now based on GBIF Citation guidelines for AWS bucket.

Hmm... I can't replicate the error you see. Note that the gbif_remote() is included in the test suite, and appears to be passing the test on those OSs (https://github.com/cboettig/gbifdb/actions/runs/1546475635) as well as the automated checks above... :thinking: . I've tested locally on my dinky windows tablet with 4 GB of RAM on my crappy home network and it works there as well...

It looks to me like a network error though -- packets lost in transmission or something. You've confirmed it's repeatable? I've tried to get others to test this remote arrow connection, I think there were a few examples where someone had trouble but then it worked? I haven't been able to find a reproducible error though on any system I can access or get someone to test... How do we proceed?

I would probably recommend users fall back on the local download for high-throughput use or over cases where network bandwidth is too slow for network-based interactive use, which is why the package provides both options.

jhollist commented 2 years ago

I can confirm that it is me, not you! Something with my work machine. I checked your actions after I sent my response and saw that they worked... I was able to get it to run on my son's laptop! I will sequester his machine and do the additional checks. I'll try to get through that tomorrow.

I'll also dig in a bit more on my machine and see if I can get it working. Stay tuned!

jhollist commented 2 years ago

@cboettig I figured it out. I had an existing aws credentials file that was likely expired. I think authentication was failing and that was causing problems. Removed my old credentials file and tried again. Everything working! Will finish with checks on this tomorrow.

cboettig commented 2 years ago

@jhollist brilliant detective work! :detective: I had forgotten about credentials! actually I'd seen that same problem before with aws.s3 download -- which fails even with valid credentials I believe because this is a public bucket, you can see I override them like so: https://github.com/cboettig/gbifdb/blob/main/R/gbif_download.R#L33

(not sure if that's the best option really, credentials in ~/.aws dir could also get in the way I think) Anyway, would be a great issue to hash out with reviewer input! Meanwhile I can at least add the issue to the function documentation.

jhollist commented 2 years ago

@ropensci-review-bot check package

ropensci-review-bot commented 2 years ago

Thanks, about to send the query.

ropensci-review-bot commented 2 years ago

:rocket:

Editor check started

:wave:

ropensci-review-bot commented 2 years ago

Checks for gbifdb (v0.1.0)

git hash: 3cf5429c

Important: All failing checks above must be addressed prior to proceeding

Package License: MIT + file LICENSE


1. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has: - code in R (100% in 5 files) and - 1 authors - 1 vignette - no internal data file - 3 imported packages - 5 exported functions (median 10 lines of code) - 7 non-exported functions in R (median 12 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-----:|----------:|:----------| |files_R | 5| 29.8| | |files_vignettes | 1| 64.8| | |files_tests | 3| 71.1| | |loc_R | 68| 4.5|TRUE | |loc_vignettes | 106| 54.0| | |loc_tests | 43| 20.1| | |num_vignettes | 1| 60.7| | |n_fns_r | 12| 9.0| | |n_fns_r_exported | 5| 21.0| | |n_fns_r_not_exported | 7| 7.8| | |n_fns_per_file_r | 1| 11.5| | |num_params_per_fn | 2| 10.7| | |loc_per_fn_r | 10| 43.9| | |loc_per_fn_r_exp | 10| 23.4| | |loc_per_fn_r_not_exp | 12| 55.9| | |rel_whitespace_R | 25| 11.8| | |rel_whitespace_vignettes | 52| 78.8| | |rel_whitespace_tests | 44| 72.6| | |doclines_per_fn_exp | 15| 5.7| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 4| 9.7| | ---

1a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


2. goodpractice and other checks

Details of goodpractice and other checks (click to open)

#### 3a. Continuous Integration Badges [![https\:\/\/github](https://github.com/cboettig/gbifdb/workflows/R-CMD-check/badge.svg)](https\:\/\/github) **GitHub Workflow Results** |name |conclusion |sha |date | |:-----------|:----------|:------|:----------| |R-CMD-check |success |3cf542 |2021-12-07 | --- #### 3b. `goodpractice` results #### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) rcmdcheck found no errors, warnings, or notes #### Test coverage with [covr](https://covr.r-lib.org/) Package coverage: 70.73 #### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) No functions have cyclocomplexity >= 15 #### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found no issues with this package!


Package Versions

|package |version | |:--------|:---------| |pkgstats |0.0.3.56 | |pkgcheck |0.0.2.195 |


Editor-in-Chief Instructions:

Processing may not proceed until the items marked with :heavy_multiplication_x: have been resolved.

jhollist commented 2 years ago

Editor checks:

Editor comments

I will start working on finding reviewers. Looking forward to working on this!


jhollist commented 2 years ago

@ropensci-review-bot assign @Pakillo as reviewer

ropensci-review-bot commented 2 years ago

@Pakillo added to the reviewers list. Review due date is 2021-12-29. Thanks @Pakillo for accepting to review! Please refer to our reviewer guide.

ropensci-review-bot commented 2 years ago

@Pakillo: If you haven't done so, please fill this form for us to update our reviewers records.

jhollist commented 2 years ago

@ropensci-review-bot assign @stephhazlitt as reviewer

ropensci-review-bot commented 2 years ago

@stephhazlitt added to the reviewers list. Review due date is 2022-01-03. Thanks @stephhazlitt for accepting to review! Please refer to our reviewer guide.

ropensci-review-bot commented 2 years ago

@stephhazlitt: If you haven't done so, please fill this form for us to update our reviewers records.

jhollist commented 2 years ago

@cboettig Given the time of year, I was planning on giving the reviewers a bit more time to complete the reviews. Would you be opposed to me setting the review due dates to Jan 10?

cboettig commented 2 years ago

@jhollist certainly! :snowman_with_snow:

jhollist commented 2 years ago

Thanks @cboettig!

@stephhazlitt and @Pakillo, note the extension on the due date. I set them for 2022-01-10.

jhollist commented 2 years ago

@stephhazlitt and @Pakillo just a quick follow up to see how the reviews are coming along. Just holler if you need a bit more time.

stephhazlitt commented 2 years ago

{gbifdb} Package Review

Thank you for the opportunity to test drive and review this great new R package. {gbifdb} brings together many things near and dear to me -- easier access to lots and lots of open biodiversity data 🦋🦉🦡, R, and leveraging new open source tools like Arrow 🏹 and DuckDB 🦆 to make all of this faster and possible on a "regular" laptop (for my review I used a MacBook Pro M1 2020). I hope this feedback is helpful, please feel free to reach out if there are follow up questions or if there is anything I can do to support moving forward.

Documentation

The package includes all the following forms of documentation:

Functionality

Estimated hours spent reviewing: 5ish


Remote Data Access

library(gbifdb)
library(dplyr)
library(tictoc)

gbif_s3 <- gbif_remote()

tic()
df <- gbif_s3 |>
  filter(species == "Haematopus bachmani",
         year > 2018,
         stateprovince == "British Columbia") |> 
  count(year) |> 
  collect() 
toc()

## Error: IOError: AWS Error [code 99]: curlCode: 28, Timeout was reached
## 6098.66 sec elapsed

Local Data Access

gbif_download(dir = "/Users/stephhazlitt/gbif-data/")

gbif_dir <- "/Users/stephhazlitt/gbif-data/occurrence.parquet/"
conn <- gbif_conn(gbif_dir)

gbif <- tbl(conn, "gbif")
gbif
library(gbifdb)
library(dplyr)
library(arrow)
library(ggplot2)

gbif_parquet <- gbif_example_data()

## database 
conn_example <- gbif_conn(gbif_parquet)
gbif_example <- tbl(conn_example, "gbif")

gbif_example |> 
  group_by(order, phylum) |> 
  summarise(n_species = n_distinct(species)) |> 
  collect() |> 
  ggplot() +
  geom_col(aes(n_species, order, fill = phylum))

## arrow
open_dataset(gbif_parquet) |> 
  group_by(order, phylum) |> 
  summarise(n_species = n_distinct(species)) |> 
  collect() |> 
  ggplot() +
  geom_col(aes(n_species, order, fill = phylum))

Minor Comments

cboettig commented 2 years ago

Thanks @stephhazlitt this is super :tada: ! One design question you bring up which I'd really like advice on:

I completely hear you that it's awkward that gbif_conn() returns a remote connection to a database, while gbif_remote() returns a remote table; i.e. that in the former we need to remember to use dplyr::tbl() but with the arrow-based interfaces we don't. This kind of begs for a gbif_local(), which would just be tbl(gbif_conn(), "gbif"). So Q1 Is it worth making dplyr a hard dependency in order to add gbif_local()?

Currently, dplyr is only a soft dependency, since presumably some applications or users will just want the DBI connection object for which they are happy querying in SQL (some users/apps will know SQL well and may not like using dplyr's doing it for them). I think of dplyr as a heavy dependency, but it's also common so maybe I should just bite that bullet?

Q2 I'm also obviously on the fence as to whether gbif_remote() should just go ahead and call to_duckdb() or not. Doing so would make the interface a bit more consistent between the two connection types. OTOH, to_duckdb() can actually slow things down a bit a present when doing queries that arrow already knows how to do, which as you noted is quite a lot in arrow 6.0.1. Thoughts on either of these?

I totally hear you on the timing and network issues. I'll add some comments in the README and function docs. One interesting case to mention might be increasingly-popular cloud-hosted computing services, like https://rstudio.cloud. Similarly, I teach my courses using RStudio instances Berkeley provides that are similarly dinky in RAM and CPU, but because they are hosted on Google Cloud have access to good bandwidth. I think when using AWS-EC2 instances in the matching region as GBIF uses would see particularly impressive performance, comparable to the local download, but almost all cloud providers have rather fast network performance, even on instances that have less RAM or CPU power than a dinky laptop.

Benchmarking network issues is tricky though, or at least I've never been properly trained how to do it! Maybe some folks from the arrow team could advice me on that. With colleagues over at GBIF, we put together this draft blog post very recently: https://github.com/gbif/data-blog/blob/master/content/post/2022-01-04-make-big-queries-using-GBIF-parquet-snapshots.md. My local laptop is a crappy Window-Go tablet that struggles to run R, but I log in remotely to an rstudio session on my desktop at work where we actually have gigabit ethernet, so I see runtimes that are almost an order of magnitude faster than John reports for remote connections, and fewer connection issues.

Also good point about the laptop falling asleep on the whole-data download - can add notes to the docs about that too. It's about a 100 GB download. Google tells me the average download speed in the US is over 100MB/s but that sounds ridiculous, let's say you get 10 MB/s downloads it ought to take 2.7 hours I guess?

I may also be worth noting that while 100 GB isn't all that large for a laptop, it's much bigger than most typical cloud providers will provide, so neon_remote() may be the only option there too.

stephhazlitt commented 2 years ago

Great questions.

I am biased towards a fully R workflow, so not sure my thoughts are representative of your user community, however I would go with the {dplyr} dependency and aim for a symmetrical 2-function per GBIF data location design:

gbif_remote() - as is, arrow table connection gbif_remore_db() - wrapper for above with to_duckdb = TRUE

gbif_local() - wrapper for arrow::open_dataset() to read locally stored parquet data gbif_local_db() - for tbl(gbif_conn(), "gbif")

My thinking is that this brings maximum flexibility for users and the package, given tools and dependencies are all changing and improving so quickly.

If you really wanted to provide a lightweight version for non-{dplyr} users, perhaps you could make a one function package for gbif_conn() and then use that package as a dependency in {gbifdb}? I am channeling learnings from the {bcdata} experience, where for sure if we had a redo it would be more than one package.

cboettig commented 2 years ago

Nice, I like it! Just hashing out function names a bit more here:

I think going from 2 to functions (gbif_remote() an gbif_local()) to four seems a bit overkill. I imagine the end user doesn't care so much about whether it's duckdb vs arrow at the backend (particularly when functions like to_duckdb() make an arrow connection use duckdb or vice versa. In particular, I'm not sure what we achieve by having the the two different flavors of gbif_local() / gbif_local_db(). Sure the duckdb-based one supports nearly the full dbplyr functionality, while arrow doesn't do windowed operations (at least not yet), but that seems a more minor difference. (technically, duckdb can access remote parquet functions directly without arrow, which means we could drop the arrow dependency and gbif_remote() would match gbif_local() use for all cases without needing to_duckdb() to use windowed operations.... but so far, it's not enabled by default in the R client, https://github.com/duckdb/duckdb/issues/2177)

re audience, I'm with you on a fully R workflow, I just meant that there's probably a subset of old-school R users out there who have been using the DBI R package since before dplyr/dbplyr and are probably happy to keep doing so. But I can keep gbif_conn() as a function for them but still make dplyr a hard dependency I think. (I'm not sure, but this might mean dbplyr should be a dependency too?) The two-package idea is a good one, but I think overkill here since this this is already a thin wrapper around duckdb and arrow.

Pakillo commented 2 years ago

Hi, apologies for the delay. Still playing with it. Hope to submit my review within a couple of days

Pakillo commented 2 years ago

{gbifdb} Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Functionality

Estimated hours spent reviewing: 12


Review Comments

Thanks a lot for the invitation to review this package. I hesitated to accept the invitation at first. First, I doubt I can ever improve anything @cboettig has written. Second, although I have some modest experience using databases with dplyr, I have never used arrow or duckdb. But, I have learnt a lot reviewing gbifdb and, as a frequent GBIF user, my own work will benefit directly from this exercise (and the package itself, of course). I do hope my review is useful, at least from the perspective of an intermediate R user keen to start using the package.

README

The README/vignette is already very informative and nice. Some suggestions for improving it:

gbif %>% 
  filter(species == "Abies pinsapo",
         countrycode == "ES",
         year > 2000) %>% 
  count(year) %>%
  collect()

# A tibble: 13 × 2
    year     n
   <int> <int>
 1  2017     4
 2  2020    44
 3  2018    24
 4  2019    61
 5  2016     6
 6  2021     7
 7  2006     5
 8  2004     1
 9  2015     3
10  2014     2
11  2013     2
12  2009     1
13  2010     1

Downloading the database

gbif_download looks really handy, but I struggled a bit to download the entire database, even using a good (fiber) internet connection.

This is the code I used to download:

gbif_download(dir = "/media/frs/Elements/gbifdb/", 
  region = "eu-central-1", 
  bucket = "gbif-open-data-eu-central-1")

My first attempt failed within 20 minutes:

# <== Saving object 'occurrence/2021-11-01/occurrence.parquet/000129' to '/media/frs/Elements/gbifdb///occurrence.parquet/000129'
# Error in curl::curl_fetch_disk(url, x$path, handle = handle) : 
#   Timeout was reached: [s3-eu-central-1.amazonaws.com] Resolving timed out after 10000 milliseconds

A second attempt gave an error after a while:

# <== Saving object 'occurrence/2021-11-01/occurrence.parquet/001044' to '/media/frs/Elements/gbifdb///occurrence.parquet/001044'
# 131 local files not found in bucket 'gbif-open-data-eu-central-1'
# List of 4
# $ Code     : chr "AccessDenied"
# $ Message  : chr "Access Denied"
# $ RequestId: chr "CA2X7EK3FCXH7RP2"
# $ HostId   : chr "9h1TicaLFACsutywE7S7etD/yuYckCyxHKbnDfL7zd7hJv2PTCqPxkV8m7gwtxKCtWPCnQEPca0="
# - attr(*, "headers")=List of 7
# ..$ x-amz-request-id : chr "CA2X7EK3FCXH7RP2"
# ..$ x-amz-id-2       : chr "9h1TicaLFACsutywE7S7etD/yuYckCyxHKbnDfL7zd7hJv2PTCqPxkV8m7gwtxKCtWPCnQEPca0="
# ..$ content-type     : chr "application/xml"
# ..$ transfer-encoding: chr "chunked"
# ..$ date             : chr "Thu, 13 Jan 2022 11:22:45 GMT"
# ..$ server           : chr "AmazonS3"
# ..$ connection       : chr "close"
# ..- attr(*, "class")= chr [1:2] "insensitive" "list"
# - attr(*, "class")= chr "aws_error"
# NULL
# Error in parse_aws_s3_response(r, Sig, verbose = verbose) : 
#   Forbidden (HTTP 403).

Third attempt also gave similar error, even though it looks like all the files had been successfully downloaded (1045 parquet files plus the citation file):

# <== Saving object 'occurrence/2021-11-01/occurrence.parquet/001044' to '/media/frs/Elements/gbifdb///occurrence.parquet/001044'
# 1046 local files not found in bucket 'gbif-open-data-eu-central-1'
# List of 4
#  $ Code     : chr "AccessDenied"
#  $ Message  : chr "Access Denied"
#  $ RequestId: chr "21VETY51D6FQ6K3A"
#  $ HostId   : chr "couGo2j17BhB5a8k3GFjI8gIrO3akAbRzCaikOOk30sEIr7d4nCwBLcZRdA0wlYAwGF969q2KZ4="
#  - attr(*, "headers")=List of 7
#   ..$ x-amz-request-id : chr "21VETY51D6FQ6K3A"
#   ..$ x-amz-id-2       : chr "couGo2j17BhB5a8k3GFjI8gIrO3akAbRzCaikOOk30sEIr7d4nCwBLcZRdA0wlYAwGF969q2KZ4="
#   ..$ content-type     : chr "application/xml"
#   ..$ transfer-encoding: chr "chunked"
#   ..$ date             : chr "Thu, 13 Jan 2022 13:55:31 GMT"
#   ..$ server           : chr "AmazonS3"
#   ..$ connection       : chr "close"
#   ..- attr(*, "class")= chr [1:2] "insensitive" "list"
#  - attr(*, "class")= chr "aws_error"
# NULL
# Error in parse_aws_s3_response(r, Sig, verbose = verbose) : 
#   Forbidden (HTTP 403).

In general, asking from ignorance, is there any way to make the download easier/less fragile to the user? Also, it seems the download starts from scratch every time you call gbif_download (that is, files already on disk are replaced again, even if requesting the same GBIF snapshot)?

Querying the local database

con <- gbif_conn(dir = "/media/frs/Elements/gbifdb/")
Error in .local(conn, statement, ...) : 
  duckdb_prepare_R: Failed to prepare query CREATE VIEW 'gbif' AS SELECT * FROM parquet_scan('/media/frs/Elements/gbifdb//*');
Error: Invalid Input Error: No magic bytes found at end of file '/media/frs/Elements/gbifdb/citation.txt'

You have to add "occurrence.parquet" for it to work:

con <- gbif_conn(dir = "/media/frs/Elements/gbifdb/occurrence.parquet/")
library(gbifdb)
library(dplyr)  

con <- gbif_conn(dir = "/media/frs/Elements/gbifdb/occurrence.parquet/")
gbif <- tbl(con, "gbif")

pinsapo <- gbif %>% 
  filter(species == "Abies pinsapo",
         countrycode == "ES",
         year > 2010) %>% 
  select(species, year, decimallatitude, decimallongitude)

pinsapo
#> # Source:   lazy query [?? x 4]
#> # Database: duckdb_connection
#>    species        year decimallatitude decimallongitude
#>    <chr>         <int>           <dbl>            <dbl>
#>  1 Abies pinsapo  2013            36.7            -4.96
#>  2 Abies pinsapo  2013            36.8            -5.39
#>  3 Abies pinsapo  2021            36.7            -5.04
#>  4 Abies pinsapo  2018            40.5            -4.1 
#>  5 Abies pinsapo  2020            40.9            -4   
#>  6 Abies pinsapo  2019            41.7            -3.7 
#>  7 Abies pinsapo  2021            36.7            -5.04
#>  8 Abies pinsapo  2021            36.7            -5.04
#>  9 Abies pinsapo  2019            40.5            -3.3 
#> 10 Abies pinsapo  2019            40.7            -4.7 
#> # … with more rows

However, there seems to be a problem when two special columns (mediatype and issue) are selected for return in the results table, causing R session to crash (see https://github.com/cboettig/gbifdb/issues/2). I hope this will be easy to fix. Otherwise it should be warned in the package documentation.

Minor comments

cboettig commented 2 years ago

@Pakillo Thanks for this amazing review (and the help tracking down the issues with in https://github.com/cboettig/gbifdb/issues/2)! Still working through it, but wanted to jot down what I've tried to address so far (See diff in PR#3).

I think I've addressed some of the performance issues you identified in gbif remote now. May still be slower than rgbif for small queries, (again depending on network speed) but will need to benchmark more.

more to come, gotta run