cboettig commented 2 years ago

Date accepted: 2022-03-22 Submitting Author Name: Carl Boettiger Submitting Author Github Handle: !--author1-->@cboettig@jhollist<!--end-editor-- Reviewers: @Pakillo, @stephhazlitt

Due date for @Pakillo: 2022-01-10 Due date for @stephhazlitt: 2022-01-10

Archive: TBD Version accepted: TBD Language: en

Paste the full DESCRIPTION file inside a code block below:

Package: gbifdb
Version: 0.1.0
Title: Local Database Interface to 'GBIF'
Description: A high performance interface to the Global Biodiversity
  Information Facility, 'GBIF'.  In contrast to 'rgbif', which can
  access small subsets of 'GBIF' data through web-based queries to
  a central server, 'gbifdb' provides enhanced performance for R users
  performing large-scale analyses on servers and cloud computing
  providers, providing full support for arbitrary 'SQL' or 'dplyr'
  operations on the complete 'GBIF' data tables (now over 1 billion
  records, and over a terrabyte in size.) 'gbifdb' accesses a copy
  of the 'GBIF' data in 'parquet' format, which is already readily
  available in commercial computing clouds such as the Amazon Open
  Data portal and the Microsoft Planetary Computer, or can be 
  accessed directly or downloaded
  to any server with suitable bandwidth and storage space.
Authors@R: c(
            person("Carl", "Boettiger", , "cboettig@gmail.com", c("aut", "cre"),
                   comment = c(ORCID = "0000-0002-1642-628X"))
            )
License: MIT + file LICENSE
Encoding: UTF-8
ByteCompile: true
Depends: R (>= 4.0)
Imports:
    duckdb (>= 0.2.9),
    DBI
Suggests:
    spelling,
    dplyr,
    dbplyr,
    testthat (>= 3.0.0),
    covr,
    knitr,
    rmarkdown,
    aws.s3,
    arrow
URL: https://github.com/cboettig/gbifdb
BugReports: https://github.com/cboettig/gbifdb
Language: en-US
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.2
Config/testthat/edition: 3
VignetteBuilder: knitr

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- [x] data retrieval
- [ ] data extraction
- [ ] data munging
- [ ] data deposition
- [ ] workflow automation
- [ ] version control
- [ ] citation management and bibliometrics
- [ ] scientific software wrappers
- [ ] field and lab reproducibility tools
- [ ] database software bindings
- [ ] geospatial data
- [ ] text analysis
Explain how and why the package falls under these categories (briefly, 1-2 sentences):

This package is similar in scope to rgbif, but allows direct access to the full occurrence table using dplyr functions with or without downloading.

Who is the target audience and what are scientific applications of this package?

Ecologists etc using / analyzing GBIF data. In particular, anyone attempting analyses which need the whole GBIF data corpus, like the example shown in the README, cannot easily do so with the standard API / rgbif approach.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

As noted above, this is similar in scope to rgbif, but more general and faster.

(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?
If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

Technical checks

Confirm each of the following by checking the box.

[x] I have read the guide for authors and rOpenSci packaging guide.

This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has a CRAN and OSI accepted license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions, created with roxygen2.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, including reporting of test coverage using services such as Travis CI, Coveralls and/or CodeCov.

Publication options

[x] Do you intend for this package to go on CRAN?
[ ] Do you intend for this package to go on Bioconductor?
[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options

- [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

[x] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

jhollist commented 2 years ago

@ropensci-review-bot submit review https://github.com/ropensci/software-review/issues/492#issuecomment-1010488630 time 5

ropensci-review-bot commented 2 years ago

Logged review for stephhazlitt (hours: 5)

jhollist commented 2 years ago

@ropensci-review-bot submit review https://github.com/ropensci/software-review/issues/492#issuecomment-1012556350 time 12

ropensci-review-bot commented 2 years ago

Couldn't find entry for pakillo in the reviews log

maelle commented 2 years ago

@jhollist this might be a small buffy bug, thanks for your patience :-) https://github.com/ropensci-org/buffy/issues/54

maelle commented 2 years ago

@jhollist could you please try again? thank you!

jhollist commented 2 years ago

@ropensci-review-bot submit review #492 (comment) time 12

ropensci-review-bot commented 2 years ago

I'm sorry human, I don't understand that. You can see what commands I support by typing:

@ropensci-review-bot help

jhollist commented 2 years ago

@ropensci-review-bot submit review https://github.com/ropensci/software-review/issues/492#issuecomment-1012556350 time 12

ropensci-review-bot commented 2 years ago

Logged review for Pakillo (hours: 12)

jhollist commented 2 years ago

Sorry for delay in moving things along, @maelle! Looks like I had a bug too! Works now.

jhollist commented 2 years ago

@cboettig Now that I finally got caught up, I have added the "Review In, Awaiting Changes" label. Sorry for slowing things down.

maelle commented 2 years ago

@jhollist Thanks a ton, actually it's a happy mistake as it's the first time this new "sorry human" response happens in this repo :grin:

cboettig commented 2 years ago

Thanks all! I'm working on revisions still as noted above, will include a complete report & links to my PR diffs when I've ticked off everything.

cboettig commented 2 years ago

Ok, I think I've hit everything now, beween https://github.com/cboettig/gbifdb/pull/3 and https://github.com/cboettig/gbifdb/pull/4 . Summaries on each PR, and maybe the diffs are helpful too.

jhollist commented 2 years ago

@Pakillo and @stephhazlitt just wanted to make sure you saw these. @cboettig has two PR's linked in https://github.com/ropensci/software-review/issues/492#issuecomment-1027489887 that should address the issues raised in your reviews. Could you please take a look at these and let us know if everything is good to go?

jhollist commented 2 years ago

@Pakillo and @stephhazlitt pinging you both again to make sure that your review comments have been addressed.

stephhazlitt commented 2 years ago

Thanks for the (double) ping @jhollist. I have pulled the peer-review/patch2 branch as I think it has all the review-related changes. I really like the new gbif_local() convenience function @cboettig. I am not sure that the top-level package documentation made it in, so I will put a plug in for that so that ?gbifdb works for users. Great work on the package! 🦋 🐝 🐒

Pakillo commented 2 years ago

Hi,

Thanks @cboettig for working out all those changes. The documentation and usage look much nicer.

Unfortunately I'm still having issues trying to make the package work...

Remote connection

First, if I try the remote connection (gbif_remote) using duckdb (the default), my R session invariable crashes whenever I try a consult, as it goes out of RAM (16 GB in my laptop).

So, I can connect successfully to the remote database: gbif <- gbif_remote(bucket = "gbif-open-data-eu-central-1", to_duckdb = TRUE)

but when I try a consult, like the one in the README or others, my R session crashes:

gbif %>%
    filter(phylum == "Chordata", year > 2010) %>%
    count(class, year)

The remote connection does work well if I choose an arrow connection. Although consulting the database keeps being a bit slow?

library(gbifdb)
library(dplyr)  # optional, for dplyr-based operations

gbif <- gbif_remote(bucket = "gbif-open-data-eu-central-1",
                    to_duckdb = FALSE)

system.time({
  ex <- gbif %>%
    filter(phylum == "Chordata", year > 2010) %>%
    count(class, year)
})
#>    user  system elapsed 
#>   0.055   0.000   0.056

system.time({  
  ex %>% 
    collect()
})
#>    user  system elapsed 
#> 310.606  28.141 126.655

system.time({
df <- gbif %>% 
  filter(species == "Abies pinsapo",
         countrycode == "ES",
         year > 2020) %>% 
  # count(year) %>%  
  select(species, year, locality, decimallatitude, decimallongitude)
})
#>    user  system elapsed 
#>   0.033   0.003   0.036

system.time({
df %>% 
  collect() 
})
#>    user  system elapsed 
#> 328.496 115.479 426.916

system.time({
df <- gbif %>% 
  filter(species == "Abies pinsapo",
         countrycode == "ES",
         year > 2000) %>% 
  count(year) %>%
  collect()
})
#>    user  system elapsed 
#> 271.148  43.273 220.665

^{Created on 2022-02-12 by the reprex package (v2.0.1)}

Session info

``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.1.2 (2021-11-01) #> os Ubuntu 20.04.3 LTS #> system x86_64, linux-gnu #> ui X11 #> language en_GB:en #> collate en_GB.UTF-8 #> ctype en_GB.UTF-8 #> tz Europe/Madrid #> date 2022-02-12 #> pandoc 2.11.4 @ /usr/lib/rstudio/bin/pandoc/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> arrow 7.0.0 2022-02-10 [1] RSPM (R 4.1.0) #> assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.1.0) #> bit 4.0.4 2020-08-04 [1] RSPM (R 4.1.0) #> bit64 4.0.5 2020-08-30 [1] RSPM (R 4.1.0) #> cli 3.1.1 2022-01-20 [1] RSPM (R 4.1.0) #> crayon 1.4.2 2021-10-29 [1] CRAN (R 4.1.1) #> DBI 1.1.2 2021-12-20 [1] RSPM (R 4.1.0) #> digest 0.6.29 2021-12-01 [1] RSPM (R 4.1.0) #> dplyr * 1.0.8 2022-02-08 [1] RSPM (R 4.1.0) #> ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.1.0) #> evaluate 0.14 2019-05-28 [1] RSPM (R 4.1.0) #> fansi 1.0.2 2022-01-14 [1] RSPM (R 4.1.0) #> fastmap 1.1.0 2021-01-25 [1] RSPM (R 4.1.0) #> fs 1.5.2 2021-12-08 [1] RSPM (R 4.1.0) #> gbifdb * 0.1.1 2022-02-12 [1] Github (cboettig/gbifdb@7f425f5) #> generics 0.1.2 2022-01-31 [1] RSPM (R 4.1.0) #> glue 1.6.1 2022-01-22 [1] CRAN (R 4.1.2) #> highr 0.9 2021-04-16 [1] RSPM (R 4.1.0) #> htmltools 0.5.2 2021-08-25 [1] RSPM (R 4.1.0) #> knitr 1.37 2021-12-16 [1] RSPM (R 4.1.0) #> lifecycle 1.0.1 2021-09-24 [1] RSPM (R 4.1.0) #> magrittr 2.0.2 2022-01-26 [1] RSPM (R 4.1.0) #> pillar 1.7.0 2022-02-01 [1] RSPM (R 4.1.0) #> pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.1.0) #> purrr 0.3.4 2020-04-17 [1] RSPM (R 4.1.0) #> R6 2.5.1 2021-08-19 [1] RSPM (R 4.1.0) #> reprex 2.0.1 2021-08-05 [1] RSPM (R 4.1.0) #> rlang 1.0.1 2022-02-03 [1] RSPM (R 4.1.0) #> rmarkdown 2.11 2021-09-14 [1] RSPM (R 4.1.0) #> rstudioapi 0.13 2020-11-12 [1] RSPM (R 4.1.0) #> sessioninfo 1.2.2 2021-12-06 [1] RSPM (R 4.1.0) #> stringi 1.7.6 2021-11-29 [1] RSPM (R 4.1.0) #> stringr 1.4.0 2019-02-10 [1] RSPM (R 4.1.0) #> tibble 3.1.6 2021-11-07 [1] RSPM (R 4.1.0) #> tidyselect 1.1.1 2021-04-30 [1] RSPM (R 4.1.0) #> utf8 1.2.2 2021-07-24 [1] RSPM (R 4.1.0) #> vctrs 0.3.8 2021-04-29 [1] RSPM (R 4.1.0) #> withr 2.4.3 2021-11-30 [1] RSPM (R 4.1.0) #> xfun 0.29 2021-12-14 [1] RSPM (R 4.1.0) #> yaml 2.2.2 2022-01-25 [1] CRAN (R 4.1.2) #> #> [1] /home/frs/R/x86_64-pc-linux-gnu-library/4.0 #> [2] /opt/R/4.1.2/lib/R/library #> #> ────────────────────────────────────────────────────────────────────────────── ```

Local database

Querying the local database does work well, and quite fast, but only when using the duckdb backend:

library(gbifdb)
library(dplyr)  

gbif <- gbif_local(dir = "/media/frs/Elements/gbifdb/occurrence.parquet/", 
                   backend = "duckdb")

system.time({
  ex <- gbif %>%
    filter(phylum == "Chordata", year > 2010) %>%
    count(class, year)
  ex
})
#>    user  system elapsed 
#>   0.015   0.000   0.015

If I choose the arrow backend, my R session always crashes (again due to RAM)

Sorry for the bad news. Please let me know if I'm doing something wrong. Otherwise I hope it's not too complicated to fix.

Cheers

P.S. The pkgdown website and the README on github are out of date as README.Rmd has not been knitted.

cboettig commented 2 years ago

Thanks @Pakillo ! Yup, looks like there are still upstream issues in arrow and duckdb being worked out -- e.g. arrow 7.0 R client has a bug in count that causes it to read into memory. I have a handful of open tickets in arrow and they have a great and very responsive team of developers, so I'm hopeful this will improve soon. See:

bizarrely, I don't see duckdb version in your sessionInfo. I have an outstanding issue on duckdb regarding RAM performance on to_duckdb() as well https://github.com/duckdb/duckdb/issues/2768, so I think we may need to wait for that one to resolve upstream as well.

It sounds like we have reasonable functionality as long as a user sticks with arrow for remote connections and duckdb for local connections then? (Ironically I've chosen the opposite setup as the default both times!) I'm still optimistic things will improve with arrow and duckdb as the upstream issues are closed.

[ ] I'll update pkgdown and README.md
[ ] I'll add the package-level doc-string (thanks @stephhazlitt , I always forget to do that one)

stephhazlitt commented 2 years ago

@Pakillo I am curious what happens when you access the data directly with arrow::open_dataset()? Probably the same crash if it is RAM-related, however I am finding that this approach is generally faster on the example dataset and subset of the GBIF data I downloaded.

library(gbifdb)
library(dplyr)

system.time({
  ex <- arrow::open_dataset("/media/frs/Elements/gbifdb/occurrence.parquet/") %>%
    filter(phylum == "Chordata", year > 2010) %>%
    count(class, year)
  ex
})

Pakillo commented 2 years ago

Thanks @cboettig. I reckon these issues stem from upstream dependencies (arrow and duckdb) - hope they are fixed soon.

bizarrely, I don't see duckdb version in your sessionInfo

That's because that in that session I am choosing directly an arrow backend (to_duckdb = FALSE), so it seems duckdb does not get loaded.

@stephhazlitt Thanks for the suggestion! Using arrow::open_dataset, as you suggest, does not crash my R session (while using gbif_local with backend = "arrow" does crash it). Why could that be @cboettig?

library(gbifdb)
library(dplyr)

system.time({
  ex <- arrow::open_dataset("/media/frs/Elements/gbifdb/occurrence.parquet/") %>%
    filter(phylum == "Chordata", year > 2010) %>%
    count(class, year)
  ex
})
#>    user  system elapsed 
#>   0.199   0.017   0.252

cboettig commented 2 years ago

Why could that be @cboettig?

gbif_local() using the arrow backend calls arrow::open_dataset(), but then takes the additional step of registering the connection with duckdb:

https://github.com/cboettig/gbifdb/blob/1b50eb6d83963d2c10e3cfb40ef2ff8fd2df57c3/R/gbif_conn.R#L28-L32

This ought to have an effect similar to arrow::to_duckdb() , but is done via the duckdb::duckdb_register_arrow instead. This more closely parallels the user experience of backend="duckdb". In particular, this means a user can take advantage of all methods, including window functions, that are supported in duckdb, and also makes the argument compatible with gbif_conn(backend="duckdb"), that is, you can get back a connection object. That could be useful if you wanted to database-like operations instead of single-table operations (e.g. join this table on another table in the same connection).

Might be good for us to report the crash upstream to duckdb. What version of duckdb do you get these crashes on?

Pakillo commented 2 years ago

I see... Thanks for the explanation! Then the issues seem to be with duckdb rather than arrow, right?

I'm using latest duckdb available on CRAN (0.3.2)

This is the code that crashes my session:

library(gbifdb)
library(dplyr)  

gbif <- gbif_local(dir = "/media/frs/Elements/gbifdb/occurrence.parquet/", 
                   backend = "arrow")

# This bit below crashes R:
gbif %>%
    filter(phylum == "Chordata", year > 2010) %>%
    count(class, year)

Session info:

─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.1.2 (2021-11-01)
 os       Ubuntu 20.04.3 LTS
 system   x86_64, linux-gnu
 ui       RStudio
 language en_GB:en
 collate  en_GB.UTF-8
 ctype    en_GB.UTF-8
 tz       Europe/Madrid
 date     2022-02-15
 rstudio  1.4.1717 Juliet Rose (desktop)
 pandoc   2.5 @ /usr/bin/pandoc

─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────
 package       * version date (UTC) lib source
 arrow           7.0.0   2022-02-10 [1] RSPM (R 4.1.0)
 assertthat      0.2.1   2019-03-21 [1] RSPM (R 4.1.0)
 aws.s3          0.3.21  2020-04-07 [1] RSPM (R 4.1.0)
 aws.signature   0.6.0   2020-06-01 [1] RSPM (R 4.1.0)
 base64enc       0.1-3   2015-07-28 [1] RSPM (R 4.1.0)
 bit             4.0.4   2020-08-04 [1] RSPM (R 4.1.0)
 bit64           4.0.5   2020-08-30 [1] RSPM (R 4.1.0)
 cli             3.1.1   2022-01-20 [1] RSPM (R 4.1.0)
 codemeta        0.1.1   2021-12-22 [1] RSPM (R 4.1.0)
 crayon          1.4.2   2021-10-29 [1] CRAN (R 4.1.1)
 curl            4.3.2   2021-06-23 [1] RSPM (R 4.1.0)
 DBI             1.1.2   2021-12-20 [1] RSPM (R 4.1.0)
 dbplyr          2.1.1   2021-04-06 [1] RSPM (R 4.1.0)
 digest          0.6.29  2021-12-01 [1] RSPM (R 4.1.0)
 dplyr         * 1.0.8   2022-02-08 [1] RSPM (R 4.1.0)
 duckdb          0.3.2   2022-02-07 [1] RSPM (R 4.1.0)
 ellipsis        0.3.2   2021-04-29 [1] RSPM (R 4.1.0)
 fansi           1.0.2   2022-01-14 [1] RSPM (R 4.1.0)
 gbifdb        * 0.1.1   2022-02-12 [1] Github (cboettig/gbifdb@7f425f5)
 generics        0.1.2   2022-01-31 [1] RSPM (R 4.1.0)
 glue            1.6.1   2022-01-22 [1] CRAN (R 4.1.2)
 httr            1.4.2   2020-07-20 [1] RSPM (R 4.1.0)
 jsonlite        1.7.3   2022-01-17 [1] RSPM (R 4.1.0)
 lifecycle       1.0.1   2021-09-24 [1] RSPM (R 4.1.0)
 magrittr        2.0.2   2022-01-26 [1] RSPM (R 4.1.0)
 pillar          1.7.0   2022-02-01 [1] RSPM (R 4.1.0)
 pkgconfig       2.0.3   2019-09-22 [1] RSPM (R 4.1.0)
 purrr           0.3.4   2020-04-17 [1] RSPM (R 4.1.0)
 R6              2.5.1   2021-08-19 [1] RSPM (R 4.1.0)
 rlang           1.0.1   2022-02-03 [1] RSPM (R 4.1.0)
 sessioninfo     1.2.2   2021-12-06 [1] RSPM (R 4.1.0)
 tibble          3.1.6   2021-11-07 [1] RSPM (R 4.1.0)
 tidyselect      1.1.1   2021-04-30 [1] RSPM (R 4.1.0)
 utf8            1.2.2   2021-07-24 [1] RSPM (R 4.1.0)
 vctrs           0.3.8   2021-04-29 [1] RSPM (R 4.1.0)
 xml2            1.3.3   2021-11-30 [1] RSPM (R 4.1.0)

 [1] /home/frs/R/x86_64-pc-linux-gnu-library/4.0
 [2] /opt/R/4.1.2/lib/R/library

cboettig commented 2 years ago

Thanks! Yes, at the moment, it does look like purely arrow-based methods are working more-or-less as expected, e.g. gbif_remote() with to_duckdb() = FALSE, or a local connection using arrow without using duckdb::duckdb_register_arrow.

I think both of these duckdb issues may be related to the still-open bug in the arrow package, https://issues.apache.org/jira/browse/ARROW-14745 , but honestly not sure, that's above my pay grade! I will try and follow up with the duckdb team.

I do think there's a few outstanding issues in both arrow and duckdb that I'd like to see resolved to feel better about either of these approaches.

On the test:

system.time({
  ex <- arrow::open_dataset("/media/frs/Elements/gbifdb/occurrence.parquet/") %>%
    filter(phylum == "Chordata", year > 2010) %>%
    count(class, year)
  ex
})

I think the implicit print command at the end should force evaluation, but I'm curious if you could just add an explicit call to collect() (or compute()) to confirm it works as expected without crashing?

Also, I'm curious if you call arrow::to_duckdb() on ex, does it work or crash?

For now it seems like it might be reasonable for gbif_local(backend="arrow") to avoid calling duckdb::duckdb_register_arrow at all. This isn't quite as tidy a solution, since it means that gbif_local() won't be just a thin wrapper around gbif_conn(), but that should be invisible to the user anyway. I can also set to_duckdb=FALSE once again as the default in gbif_remote(), thus both functions will give a "pure arrow" experience. (I'll also have to add something to deal with the less-desirable print method of open_dataset() which does not return a tbl class object, and can make duckdb dependency optional then since it's not touched by default settings.

I will try and do some more benchmarking as well.

Pakillo commented 2 years ago

Thanks @cboettig

I'm curious if you could just add an explicit call to collect()

I can confirm using arrow::open_dataset works well, also using collect:

library(gbifdb)
library(dplyr)

arrow::open_dataset("/media/frs/Elements/gbifdb/occurrence.parquet/") %>%
  filter(phylum == "Chordata", year > 2010) %>%
  count(class, year) %>% 
  collect()
#> # A tibble: 163 × 3
#> # Groups:   class [15]
#>    class           year        n
#>    <chr>          <int>    <int>
#>  1 Aves            2015 71131647
#>  2 Reptilia        2012    57146
#>  3 Reptilia        2015    58420
#>  4 Reptilia        2013    91989
#>  5 Reptilia        2017    64181
#>  6 Mammalia        2018   450071
#>  7 Mammalia        2013   492888
#>  8 Actinopterygii  2012  1487182
#>  9 Reptilia        2011    48347
#> 10 Mammalia        2019   452761
#> # … with 153 more rows

Also, I'm curious if you call arrow::to_duckdb() on ex, does it work or crash?

It works!

library(gbifdb)
library(dplyr)

ex <- arrow::open_dataset("/media/frs/Elements/gbifdb/occurrence.parquet/") %>%
  filter(phylum == "Chordata", year > 2010) %>%
  count(class, year)

ex.duckdb <- arrow::to_duckdb(ex)
ex.duckdb
#> # Source:   table<arrow_001> [?? x 3]
#> # Database: duckdb_connection
#> # Groups:   class
#>    class           year        n
#>    <chr>          <int>    <dbl>
#>  1 Aves            2015 71131647
#>  2 Reptilia        2012    57146
#>  3 Reptilia        2015    58420
#>  4 Reptilia        2013    91989
#>  5 Reptilia        2017    64181
#>  6 Mammalia        2018   450071
#>  7 Mammalia        2013   492888
#>  8 Actinopterygii  2012  1487182
#>  9 Reptilia        2011    48347
#> 10 Mammalia        2019   452761
#> # … with more rows

jonkeane commented 2 years ago

I think both of these duckdb issues may be related to the still-open bug in the arrow package, https://issues.apache.org/jira/browse/ARROW-14745 , but honestly not sure, that's above my pay grade! I will try and follow up with the duckdb team.

That issue is closed (though the fix for it is not in the 7.0 release!) It's possible that the crash is being caused by the behavior that's in 7.0 still. Though I will note that that change is only for going from duckdb into arrow (e.g. via to_arrow()).

I haven't looked at the code in GBIF in detail, but at first glance I did not see code that does that. Am I missing something? Or is it that after being read in with arrow and then registered with duckdb, the querying is happening all inside of duckdb?

cboettig commented 2 years ago

@jonkeane thanks so much for joining us on this! Yes, I mean to test on the nightly, as you see @Pakillo is testing against the current 7.0 release. gbifdb does not use to_arrow() anywhere though. gbifdb uses arrow::to_duckdb (optionally) on the remote data source, which seems to result in crashes (though it seems okay if to_duckdb() is used on local parquet instead of over the S3 interface).

gbifdb also uses duckdb::duckdb_register_arrow on an arrow::open_dataset() called on the local parquet files, which seems to cause trouble as well. This seems to be nearly the same thing as calling arrow::to_duckdb() (a quick skim of source code in both cases is that both are creating a "View" in duckdb of the parquet tables, right? so I think either case is handing off to duckdb for the actual queries there-after), so I'm not sure why the one (arrow::to_duckdb) works while the other fails. (Though if that's true, it's not clear to me how arrow::to_duckdb() is working on remote sources in the first place?)

jhollist commented 2 years ago

Just wanted to let you all know that I haven't been ignoring this, just didn't have anything useful to add. At this point it sounds like we are making progress on tracking down the source of the crashes, but still trying to figure out the solution. If there is anything you need from me, just let me know and I will keep an eye on the conversation.

jonkeane commented 2 years ago

gbifdb uses arrow::to_duckdb (optionally) on the remote data source, which seems to result in crashes (though it seems okay if to_duckdb() is used on local parquet instead of over the S3 interface).

Interesting. To be totally honest, I don't think I myself have tried to take an S3-backed dataset and run it through to_duckdb() so it is totally possible that there's an issue there. If you have a reproducer would you mind making a Jira for us?

gbifdb also uses duckdb::duckdb_register_arrow on an arrow::open_dataset() called on the local parquet files, which seems to cause trouble as well. This seems to be nearly the same thing as calling arrow::to_duckdb() (a quick skim of source code in both cases is that both are creating a "View" in duckdb of the parquet tables, right? so I think either case is handing off to duckdb for the actual queries there-after), so I'm not sure why the one (arrow::to_duckdb) works while the other fails. (Though if that's true, it's not clear to me how arrow::to_duckdb() is working on remote sources in the first place?)

You're right that arrow::to_duckdb() is mostly only calling duckdb::duckdb_register_arrow() which registers the View (IIRC in the DuckDB world it's actually an external table or something like that, but those are very similar to views!). We also do a bit of work to make sure there is a duckdb connection already (and create one if not) + wrap the DuckDB reference in tbl() so that it can be used in dplyr queries + setup a finalizer so that we clean the DuckDB reference when the tbl is gc()ed. Any chance you have a stack trace from where the crash is happening?

Pakillo commented 2 years ago

Hi,

I'm happy to help fix this, but I don't know what is needed exactly (beyond what I've shared above). How can I generate a stack trace of those problems? Or can you do that @cboettig?

Also, I wonder: am I the only one getting these crashes? Why? Do you have much larger RAM available?

jhollist commented 2 years ago

@Pakillo

gbif <- gbif_remote(to_duckdb = TRUE)
gbif %>%
    filter(phylum == "Chordata", year > 2010) %>%
    count(class, year)

Does not run on my machine either. I don't ever get to a crash, it hangs. I let it run for several hours last night just to be sure it wasn't just taking long.

@cboettig Do you have a plan for solving these or do we want to put the submission on hold until some of the dependency issues clear up?

cboettig commented 2 years ago

Hi all,

I really appreciate all the help and testing! Just wanted to make sure I'm keeping straight what is and isn't crashing for everyone.

Remote sources

gbifdb has two ways to access data without download:

gbif_remote(to_duckdb = FALSE)
gbif_remote(to_duckdb = TRUE)

Am I correct that everyone sees issues with the second case but not the first on current releases, arrow 7.0 and duckdb 0.3.2?

Local

gbifdb has two ways to do local parquet access currently (1-2 below), but we're also now discussing 2 additional strategies (3-4) by which arrow could access the local dataset:

gbif_local(backend = "arrow") (which is arrow::open_dataset() + duckdb::duckdb_register_arrow)
gbif_local(backend = "duckdb") (which is "pure duckdb")
arrow::open_dataset() + to_duckdb()
arrow::open_dataset() without to_duckdb()

Do I have it correct that: 1 & 3 crash for everyone, but 2 and 4 work for everyone? (I'm double-checking these on my end too with dev versions of the packages)>

I feel like local access strategies 2 & 3 ought to be practically identical, so I'm a bit puzzled there.

Practically speaking, if my summary is correct, I can make is that for gbif_remote(), option 1 is the default, and for gbif_local(), that option 2 is the default. (this makes the remote option "pure arrow" and the local option "pure duckdb", with the minor downside that "pure arrow" doesn't return a tibble or support windowing, but can live with that until to_duckdb is fixed).

I don't want to over-impose on our rOpenSci reviewers & editors here, but really appreciate everyone's perspectives and input in weighing these alternatives.

For the cases that crash, I definitely want to make sure we can get detailed, reproducible bug reports to arrow JIRA and duckdb GitHub trackers.

stephhazlitt commented 2 years ago

@cboettig I am working through the various strategies today and will report back with details.

cboettig commented 2 years ago

fwiw I just noticed gbif_download() wasn't sync'ing correctly, so I've updated that if you want to again attempt downloading the whole thing. https://github.com/cboettig/gbifdb/tree/patch-download

stephhazlitt commented 2 years ago

I was not experiencing the same crashing as others, however as a confirmation, I thought I would try the various paths out again. I installed the patch-download version before running these.

Remote Options

1. arrow::open_dataset()

library(gbifdb)
library(dplyr)
library(arrow)

> ## arrow::open_dataset()
> bucket <- s3_bucket("gbif-open-data-us-east-1/occurrence/2022-02-01/occurrence.parquet")
> gbif_bucket <- open_dataset(bucket)
> 
> system.time({
+ df_arrow <- gbif_bucket |>
+   filter(phylum == "Chordata", year > 2010) |> 
+   count(class, year) |> 
+   collect()
+ df_arrow 
+ })
    user   system  elapsed 
 182.430   35.228 1008.520

2. gbifdb::gbif_remote(to_duckdb = FALSE)

> ## gbif::gbif_remote w/ arrow
> gbif_arrow <- gbif_remote(bucket = "gbif-open-data-us-east-1",
+                           to_duckdb = FALSE)
> 
> system.time({
+ df_gbif_arrow <- gbif_arrow |>
+   filter(phylum == "Chordata", year > 2010) |> 
+   count(class, year) |> 
+   collect() 
+ df_gbif_arrow
+ })
* Closing connection 0
    user   system  elapsed 
 184.142   36.059 1055.453

3. gbifdb::gbif_remote(to_duckdb = TRUE)

> gbif_duckdb <- gbif_remote(bucket = "gbif-open-data-us-east-1",
+                         to_duckdb = TRUE)
* Closing connection 0
> 
> system.time({
+ df_gbif_duckdb <- gbif_duckdb |>
+   filter(phylum == "Chordata", year > 2010) |> 
+   count(class, year) |> 
+   collect()
+ df_gbif_duckdb
+ })
    user   system  elapsed 
 283.422   82.892 2137.000

Session Info

> session_info()
─ Session info ─────────────────────────────────────
 setting  value
 version  R version 4.1.2 (2021-11-01)
 os       macOS Big Sur 11.6
 system   aarch64, darwin20
 ui       RStudio
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/Vancouver
 date     2022-02-18
 rstudio  2021.09.2+382 Ghost Orchid (desktop)
 pandoc   NA

─ Packages ─────────────────────────────────────────
 package     * version date (UTC) lib source
 arrow       * 7.0.0   2022-02-10 [1] CRAN (R 4.1.1)
 assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.1.0)
 bit           4.0.4   2020-08-04 [1] CRAN (R 4.1.1)
 bit64         4.0.5   2020-08-30 [1] CRAN (R 4.1.0)
 blob          1.2.2   2021-07-23 [1] CRAN (R 4.1.0)
 brio          1.1.3   2021-11-30 [1] CRAN (R 4.1.1)
 cachem        1.0.6   2021-08-19 [1] CRAN (R 4.1.1)
 callr         3.7.0   2021-04-20 [1] CRAN (R 4.1.0)
 cli           3.2.0   2022-02-14 [1] CRAN (R 4.1.2)
 crayon        1.5.0   2022-02-14 [1] CRAN (R 4.1.2)
 DBI           1.1.2   2021-12-20 [1] CRAN (R 4.1.2)
 dbplyr        2.1.1   2021-04-06 [1] CRAN (R 4.1.0)
 desc          1.4.0   2021-09-28 [1] CRAN (R 4.1.1)
 devtools    * 2.4.3   2021-11-30 [1] CRAN (R 4.1.1)
 dplyr       * 1.0.8   2022-02-08 [1] CRAN (R 4.1.2)
 duckdb        0.3.2   2022-02-07 [1] CRAN (R 4.1.2)
 ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.0)
 fansi         1.0.2   2022-01-14 [1] CRAN (R 4.1.1)
 fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.1.0)
 fs            1.5.2   2021-12-08 [1] CRAN (R 4.1.2)
 gbifdb      * 0.1.1   2022-02-18 [1] Github (cboettig/gbifdb@51c6d30)
 generics      0.1.2   2022-01-31 [1] CRAN (R 4.1.2)
 glue          1.6.1   2022-01-22 [1] CRAN (R 4.1.1)
 lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.1.1)
 magrittr      2.0.2   2022-01-26 [1] CRAN (R 4.1.1)
 memoise       2.0.1   2021-11-26 [1] CRAN (R 4.1.1)
 pillar        1.7.0   2022-02-01 [1] CRAN (R 4.1.2)
 pkgbuild      1.3.1   2021-12-20 [1] CRAN (R 4.1.2)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.1.0)
 pkgload       1.2.4   2021-11-30 [1] CRAN (R 4.1.1)
 praise        1.0.0   2015-08-11 [1] CRAN (R 4.1.0)
 prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.1.0)
 processx      3.5.2   2021-04-30 [1] CRAN (R 4.1.0)
 ps            1.6.0   2021-02-28 [1] CRAN (R 4.1.0)
 purrr         0.3.4   2020-04-17 [1] CRAN (R 4.1.0)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.1.1)
 remotes       2.4.2   2021-11-30 [1] CRAN (R 4.1.1)
 rlang         1.0.1   2022-02-03 [1] CRAN (R 4.1.2)
 rprojroot     2.0.2   2020-11-15 [1] CRAN (R 4.1.0)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.1)
 testthat    * 3.1.2   2022-01-20 [1] CRAN (R 4.1.1)
 tibble        3.1.6   2021-11-07 [1] CRAN (R 4.1.1)
 tidyselect    1.1.1   2021-04-30 [1] CRAN (R 4.1.0)
 usethis     * 2.1.5   2021-12-09 [1] CRAN (R 4.1.1)
 utf8          1.2.2   2021-07-24 [1] CRAN (R 4.1.0)
 vctrs         0.3.8   2021-04-29 [1] CRAN (R 4.1.0)
 withr         2.4.3   2021-11-30 [1] CRAN (R 4.1.1)

 [1] /Users/stephhazlitt/Rlibrary
 [2] /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/library

Pakillo commented 2 years ago

Hi, here are my results

Remote options

1. arrow::open_dataset() - WORKS

library(gbifdb)
library(dplyr)
library(arrow)

## arrow::open_dataset()
bucket <- s3_bucket("gbif-open-data-eu-central-1/occurrence/2022-02-01/occurrence.parquet")
gbif_bucket <- open_dataset(bucket)

system.time({
  df_arrow <- gbif_bucket |>
    filter(phylum == "Chordata", year > 2010) |>
    count(class, year) |>
    collect()
  df_arrow
})
#>    user  system elapsed 
#> 235.360  27.849 203.743

2. gbifdb::gbif_remote(to_duckdb = FALSE) - WORKS

library(gbifdb)
library(dplyr)

gbif_arrow <- gbif_remote(bucket = "gbif-open-data-eu-central-1",
                          to_duckdb = FALSE)

system.time({
  df_gbif_arrow <- gbif_arrow |>
    filter(phylum == "Chordata", year > 2010) |> 
    count(class, year) |> 
    collect() 
  df_gbif_arrow
})
#>    user  system elapsed 
#> 235.260  26.648 181.525

3. gbifdb::gbif_remote(to_duckdb = TRUE) - CRASH

CRASHES R (due to RAM)

Pakillo commented 2 years ago

Local database

1. gbif_local(backend = "arrow") - CRASH

(which is arrow::open_dataset() + duckdb::duckdb_register_arrow)

gbif <- gbif_local(dir = "/media/frs/Elements/gbifdb/occurrence.parquet/", 
                     backend = "arrow")
gbif %>%
  filter(phylum == "Chordata", year > 2010) %>%
  count(class, year) %>% 
  collect()

CRASH

2. gbif_local(backend = "duckdb") - WORKS

(which is "pure duckdb")

system.time({
  gbif <- gbif_local(dir = "/media/frs/Elements/gbifdb/occurrence.parquet/", 
                     backend = "duckdb")
  gbif %>%
    filter(phylum == "Chordata", year > 2010) %>%
    count(class, year) %>% 
    collect()
})
#>    user  system elapsed 
#>  78.476   2.166  94.556

3. arrow::open_dataset() + to_duckdb() - CRASH

system.time({
  arrow::open_dataset("/media/frs/Elements/gbifdb/occurrence.parquet/") %>% 
  arrow::to_duckdb() %>%
    filter(phylum == "Chordata", year > 2010) %>%
    count(class, year) %>% 
    collect()
})

CRASHES R

4. arrow::open_dataset() without to_duckdb() - WORKS

system.time({
  arrow::open_dataset("/media/frs/Elements/gbifdb/occurrence.parquet/") %>%
    filter(phylum == "Chordata", year > 2010) %>%
    count(class, year) %>% 
    collect()
})
#>    user  system elapsed 
#> 273.518  15.553  40.928

Conclusions

gbif_remote works well when to_duckdb = FALSE.
gbif_local works well when backend = "duckdb"

So, the problem seems to be when converting arrow to duckdb

(Note that in a previous example (https://github.com/ropensci/software-review/issues/492#issuecomment-1041361595) converting arrow to duckdb did work well, but I was passing the output of a query, not the entire database)

I feel like local access strategies 2 (gbif_local(backend = "duckdb")) & 3 (arrow::open_dataset() + to_duckdb()) ought to be practically identical, so I'm a bit puzzled there.

From a quick look at the code, it looks like gbif_local with backend = "duckdb" uses duckdb to read database directly, no? So it does avoid the arrow::to_duckdb() conversion (or duckdb::duckdb_register_arrow) which seem to be causing the problems...

stephhazlitt commented 2 years ago

Local Options

library(gbifdb)
library(dplyr)
library(arrow)

gbif_dir <- "/Users/stephhazlitt/dev/gbif-data/occurrence.parquet/"

1. gbif_local(backend = "arrow") (which is arrow::open_dataset() + duckdb::duckdb_register_arrow)

WORKS

# 1. gbif_local(backend = "arrow") (which is arrow::open_dataset() + duckdb::duckdb_register_arrow)
gbif_local(dir = gbif_dir,
           backend = "arrow") |>
  filter(phylum == "Chordata", year > 2010) |>
  count(class, year) |>
  collect()
#> # A tibble: 143 × 3
#> # Groups:   class [15]
#>    class           year     n
#>    <chr>          <int> <dbl>
#>  1 Reptilia        2018 10837
#>  2 Actinopterygii  2019 21428
#>  3 Reptilia        2013  6236
#>  4 Actinopterygii  2014 70435
#>  5 Actinopterygii  2013 91919
#>  6 <NA>            2013    60
#>  7 Reptilia        2014  6128
#>  8 Amphibia        2017  9118
#>  9 Actinopterygii  2016 58846
#> 10 Reptilia        2017  9534
#> # … with 133 more rows

2. gbif_local(backend = "duckdb") (which is "pure duckdb")

WORKS

# 2. gbif_local(backend = "duckdb") (which is "pure duckdb")
gbif_local(dir = gbif_dir,
           backend = "duckdb") |>
  filter(phylum == "Chordata", year > 2010) |>
  count(class, year) |>
  collect()
#> # A tibble: 143 × 3
#> # Groups:   class [15]
#>    class       year     n
#>    <chr>      <int> <dbl>
#>  1 Ascidiacea  2018   187
#>  2 Ascidiacea  2011   302
#>  3 Ascidiacea  2012   276
#>  4 Ascidiacea  2014   757
#>  5 Ascidiacea  2016   335
#>  6 Ascidiacea  2015   392
#>  7 Ascidiacea  2013   169
#>  8 Ascidiacea  2017   394
#>  9 Ascidiacea  2019   161
#> 10 Ascidiacea  2020   144
#> # … with 133 more rows

3. arrow::open_dataset() + to_duckdb()

CRASHES

# 3. arrow::open_dataset() + to_duckdb()
arrow::open_dataset(gbif_dir) |>
  to_duckdb() |> 
  filter(phylum == "Chordata", year > 2010) |>
  count(class, year) |>
  collect()

4. arrow::open_dataset() without to_duckdb()

WORKS

# 4. arrow::open_dataset() without to_duckdb()
arrow::open_dataset(gbif_dir) |>
  filter(phylum == "Chordata", year > 2010) |>
  count(class, year) |>
  collect()
#> # A tibble: 143 × 3
#> # Groups:   class [15]
#>    class           year       n
#>    <chr>          <int>   <int>
#>  1 Aves            2015 1880023
#>  2 Reptilia        2012    5686
#>  3 Reptilia        2015    6219
#>  4 Reptilia        2013    6236
#>  5 Reptilia        2017    9534
#>  6 Mammalia        2018   45356
#>  7 Mammalia        2013   57009
#>  8 Actinopterygii  2012  105466
#>  9 Reptilia        2011    5481
#> 10 Mammalia        2019   58039
#> # … with 133 more rows

stephhazlitt commented 2 years ago

I am not sure why now, as it often just crashed out right, but when I was running the 3. arrow::open_dataset() + to_duckdb() my computer prompted me to force close applications as it was running out of memory -- so I guess it is a RAM issue. No issue with any of the options with the very small example data set:

library(gbifdb)
library(dplyr)
library(arrow)

# 1. gbif_local(backend = "arrow") (which is arrow::open_dataset() + duckdb::duckdb_register_arrow)
gbif_local(dir = gbif_example_data(),
           backend = "arrow") |>
  filter(kingdom == "Plantae") |>
  group_by(class) |> 
  count() |>
  collect()
#> # A tibble: 6 × 2
#>   class              n
#>   <chr>          <dbl>
#> 1 Lycopodiopsida    35
#> 2 Liliopsida       461
#> 3 Magnoliopsida    317
#> 4 Polypodiopsida   135
#> 5 Pinopsida         27
#> 6 Gnetopsida         7

# 2. gbif_local(backend = "duckdb") (which is "pure duckdb")
gbif_local(dir = gbif_example_data(),
           backend = "duckdb") |>
  filter(kingdom == "Plantae") |>
  group_by(class) |> 
  count() |>
  collect()
#> # A tibble: 6 × 2
#>   class              n
#>   <chr>          <dbl>
#> 1 Lycopodiopsida    35
#> 2 Liliopsida       461
#> 3 Magnoliopsida    317
#> 4 Polypodiopsida   135
#> 5 Pinopsida         27
#> 6 Gnetopsida         7

# 3. arrow::open_dataset() + to_duckdb()
arrow::open_dataset(gbif_example_data()) |>
  to_duckdb() |> 
  filter(kingdom == "Plantae") |>
  group_by(class) |> 
  count() |>
  collect()
#> # A tibble: 6 × 2
#>   class              n
#>   <chr>          <dbl>
#> 1 Lycopodiopsida    35
#> 2 Liliopsida       461
#> 3 Magnoliopsida    317
#> 4 Polypodiopsida   135
#> 5 Pinopsida         27
#> 6 Gnetopsida         7

# 4. arrow::open_dataset() without to_duckdb()
arrow::open_dataset(gbif_example_data()) |>
  filter(kingdom == "Plantae") |>
  group_by(class) |> 
  count() |>
  collect()
#> # A tibble: 6 × 2
#> # Groups:   class [6]
#>   class              n
#>   <chr>          <int>
#> 1 Lycopodiopsida    35
#> 2 Liliopsida       461
#> 3 Magnoliopsida    317
#> 4 Polypodiopsida   135
#> 5 Pinopsida         27
#> 6 Gnetopsida         7

jhollist commented 2 years ago

@cboettig Just checking in to see where we are on this. It sounds like we are in a holding pattern until the dependency issues are resolved. It might make sense to put the review on hold until then, unless you think it won't be too much longer. Thoughts?

cboettig commented 2 years ago

@jhollist thanks for the ping!

Good question, yeah, can't really predict how long it might be to see the to_duckdb() or duckdb_register_arrow() mechanisms fixed.

It sounds like we could just make gbif_local() be "pure-duckdb" and gbif_remote() be pure arrow and we would not have any issues. Once those mechanisms stabilize, we could put them back in (I think they have the potential for nicer consistency and performance once they are finally resolved, but kinda a minor thing). Does that sound reasonable to everyone?

jhollist commented 2 years ago

That would work for me. @stephhazlitt and @Pakillo does @cboettig suggestion work for you.

stephhazlitt commented 2 years ago

Very reasonable approach to me.

Pakillo commented 2 years ago

Sounds good to me. Whatever you prefer!

cboettig commented 2 years ago

Thanks! I've switched to the duckdb-only local and arrow-only remote as the defaults, https://github.com/cboettig/gbifdb/pull/6

jhollist commented 2 years ago

@cboettig Thanks for those changes. I can confirm that gbif_remote() works for me locally. I didn't test gbif_local() as I don't have 100GB free! That being said, looks like no one had issues with the pure duckdb versions.

Let me refresh my memory on next steps here, but I think we are very close!

@Pakillo and @stephhazlitt any remaining concerns?

jhollist commented 2 years ago

@cboettig we are good to go!

@Pakillo and @stephhazlitt I went ahead and checked off the two remaining items on your checklists. The subsequent discussion indicates you are both good to go with accepting gbifdb.

Congrats, Carl! Another great contribution.

jhollist commented 2 years ago

@ropensci-review-bot approve gbifdb

ropensci-review-bot commented 2 years ago

Approved! Thanks @cboettig for submitting and @Pakillo, @stephhazlitt for your reviews! :grin:

To-dos:

[ ] Transfer the repo to rOpenSci's "ropensci" GitHub organization under "Settings" in your repo. I have invited you to a team that should allow you to do so.
[ ] After transfer write a comment @ropensci-review-bot finalize transfer of <package-name> where <package-name> is the repo/package name. This will give you admin access back.
[ ] Fix all links to the GitHub repo to point to the repo under the ropensci organization.
[ ] Delete your current code of conduct file if you had one since rOpenSci's default one will apply, see https://devguide.ropensci.org/collaboration.html#coc-file
[ ] If you already had a pkgdown website and are ok relying only on rOpenSci central docs building and branding,
- deactivate the automatic deployment you might have set up
- remove styling tweaks from your pkgdown config but keep that config file
- replace the whole current pkgdown website with a redirecting page
- replace your package docs URL with https://docs.ropensci.org/package_name
- In addition, in your DESCRIPTION file, include the docs link in the URL field alongside the link to the GitHub repository, e.g.: URL: https://docs.ropensci.org/foobar (website) https://github.com/ropensci/foobar
[ ] Fix any links in badges for CI and coverage to point to the new repository URL.
[ ] Increment the package version to reflect the changes you made during review. In NEWS.md, add a heading for the new version and one bullet for each user-facing change, and each developer-facing change that you think is relevant.
[ ] We're starting to roll out software metadata files to all rOpenSci packages via the Codemeta initiative, see https://docs.ropensci.org/codemetar/ for how to include it in your package, after installing the package - should be easy as running codemetar::write_codemeta() in the root of your package.
[ ] You can add this installation method to your package README install.packages("<package-name>", repos = "https://ropensci.r-universe.dev") thanks to R-universe.

Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"-type contributors in the Authors@R field (with their consent).

Welcome aboard! We'd love to host a post about your package - either a short introduction to it with an example for a technical audience or a longer post with some narrative about its development or something you learned, and an example of its use for a broader readership. If you are interested, consult the blog guide, and tag @ropensci/blog-editors in your reply. She will get in touch about timing and can answer any questions.

We maintain an online book with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding (with advice on releases, package marketing, GitHub grooming); the guide also feature CRAN gotchas. Please tell us what could be improved.

Last but not least, you can volunteer as a reviewer via filling a short form.

ropensci / software-review

gbifdb: Local Database Interface to 'GBIF' #492

Scope

Technical checks

Publication options

Code of conduct

Remote connection

Local database

Remote sources

Local

Remote Options

1. arrow::open_dataset()

2. gbifdb::gbif_remote(to_duckdb = FALSE)

3. gbifdb::gbif_remote(to_duckdb = TRUE)

Session Info

Remote options

1. arrow::open_dataset() - WORKS

2. gbifdb::gbif_remote(to_duckdb = FALSE) - WORKS

3. gbifdb::gbif_remote(to_duckdb = TRUE) - CRASH

Local database

1. gbif_local(backend = "arrow") - CRASH

2. gbif_local(backend = "duckdb") - WORKS

3. arrow::open_dataset() + to_duckdb() - CRASH

4. arrow::open_dataset() without to_duckdb() - WORKS

Conclusions

Local Options