ropensci / software-review

rOpenSci Software Peer Review.
295 stars 104 forks source link

gbifdb: Local Database Interface to 'GBIF' #492

Closed cboettig closed 2 years ago

cboettig commented 2 years ago

Date accepted: 2022-03-22 Submitting Author Name: Carl Boettiger Submitting Author Github Handle: !--author1-->@cboettig<!--end-author1-- Repository: https://github.com/cboettig/gbifdb Version submitted: 0.1.0 Submission type: Standard Editor: !--editor-->@jhollist<!--end-editor-- Reviewers: @Pakillo, @stephhazlitt

Due date for @Pakillo: 2022-01-10 Due date for @stephhazlitt: 2022-01-10

Archive: TBD Version accepted: TBD Language: en


Package: gbifdb
Version: 0.1.0
Title: Local Database Interface to 'GBIF'
Description: A high performance interface to the Global Biodiversity
  Information Facility, 'GBIF'.  In contrast to 'rgbif', which can
  access small subsets of 'GBIF' data through web-based queries to
  a central server, 'gbifdb' provides enhanced performance for R users
  performing large-scale analyses on servers and cloud computing
  providers, providing full support for arbitrary 'SQL' or 'dplyr'
  operations on the complete 'GBIF' data tables (now over 1 billion
  records, and over a terrabyte in size.) 'gbifdb' accesses a copy
  of the 'GBIF' data in 'parquet' format, which is already readily
  available in commercial computing clouds such as the Amazon Open
  Data portal and the Microsoft Planetary Computer, or can be 
  accessed directly or downloaded
  to any server with suitable bandwidth and storage space.
Authors@R: c(
            person("Carl", "Boettiger", , "cboettig@gmail.com", c("aut", "cre"),
                   comment = c(ORCID = "0000-0002-1642-628X"))
            )
License: MIT + file LICENSE
Encoding: UTF-8
ByteCompile: true
Depends: R (>= 4.0)
Imports:
    duckdb (>= 0.2.9),
    DBI
Suggests:
    spelling,
    dplyr,
    dbplyr,
    testthat (>= 3.0.0),
    covr,
    knitr,
    rmarkdown,
    aws.s3,
    arrow
URL: https://github.com/cboettig/gbifdb
BugReports: https://github.com/cboettig/gbifdb
Language: en-US
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.2
Config/testthat/edition: 3
VignetteBuilder: knitr

Scope

This package is similar in scope to rgbif, but allows direct access to the full occurrence table using dplyr functions with or without downloading.

Ecologists etc using / analyzing GBIF data. In particular, anyone attempting analyses which need the whole GBIF data corpus, like the example shown in the README, cannot easily do so with the standard API / rgbif approach.

As noted above, this is similar in scope to rgbif, but more general and faster.

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

MEE Options - [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

jhollist commented 2 years ago

@ropensci-review-bot submit review https://github.com/ropensci/software-review/issues/492#issuecomment-1010488630 time 5

ropensci-review-bot commented 2 years ago

Logged review for stephhazlitt (hours: 5)

jhollist commented 2 years ago

@ropensci-review-bot submit review https://github.com/ropensci/software-review/issues/492#issuecomment-1012556350 time 12

ropensci-review-bot commented 2 years ago

Couldn't find entry for pakillo in the reviews log

maelle commented 2 years ago

@jhollist this might be a small buffy bug, thanks for your patience :-) https://github.com/ropensci-org/buffy/issues/54

maelle commented 2 years ago

@jhollist could you please try again? thank you!

jhollist commented 2 years ago

@ropensci-review-bot submit review #492 (comment) time 12

ropensci-review-bot commented 2 years ago

I'm sorry human, I don't understand that. You can see what commands I support by typing:

@ropensci-review-bot help

jhollist commented 2 years ago

@ropensci-review-bot submit review https://github.com/ropensci/software-review/issues/492#issuecomment-1012556350 time 12

ropensci-review-bot commented 2 years ago

Logged review for Pakillo (hours: 12)

jhollist commented 2 years ago

Sorry for delay in moving things along, @maelle! Looks like I had a bug too! Works now.

jhollist commented 2 years ago

@cboettig Now that I finally got caught up, I have added the "Review In, Awaiting Changes" label. Sorry for slowing things down.

maelle commented 2 years ago

@jhollist Thanks a ton, actually it's a happy mistake as it's the first time this new "sorry human" response happens in this repo :grin:

cboettig commented 2 years ago

Thanks all! I'm working on revisions still as noted above, will include a complete report & links to my PR diffs when I've ticked off everything.

cboettig commented 2 years ago

Ok, I think I've hit everything now, beween https://github.com/cboettig/gbifdb/pull/3 and https://github.com/cboettig/gbifdb/pull/4 . Summaries on each PR, and maybe the diffs are helpful too.

jhollist commented 2 years ago

@Pakillo and @stephhazlitt just wanted to make sure you saw these. @cboettig has two PR's linked in https://github.com/ropensci/software-review/issues/492#issuecomment-1027489887 that should address the issues raised in your reviews. Could you please take a look at these and let us know if everything is good to go?

jhollist commented 2 years ago

@Pakillo and @stephhazlitt pinging you both again to make sure that your review comments have been addressed.

stephhazlitt commented 2 years ago

Thanks for the (double) ping @jhollist. I have pulled the peer-review/patch2 branch as I think it has all the review-related changes. I really like the new gbif_local() convenience function @cboettig. I am not sure that the top-level package documentation made it in, so I will put a plug in for that so that ?gbifdb works for users. Great work on the package! πŸ¦‹ 🐝 πŸ’

Pakillo commented 2 years ago

Hi,

Thanks @cboettig for working out all those changes. The documentation and usage look much nicer.

Unfortunately I'm still having issues trying to make the package work...

Remote connection

First, if I try the remote connection (gbif_remote) using duckdb (the default), my R session invariable crashes whenever I try a consult, as it goes out of RAM (16 GB in my laptop).

So, I can connect successfully to the remote database: gbif <- gbif_remote(bucket = "gbif-open-data-eu-central-1", to_duckdb = TRUE)

but when I try a consult, like the one in the README or others, my R session crashes:

gbif %>%
    filter(phylum == "Chordata", year > 2010) %>%
    count(class, year)

The remote connection does work well if I choose an arrow connection. Although consulting the database keeps being a bit slow?

library(gbifdb)
library(dplyr)  # optional, for dplyr-based operations

gbif <- gbif_remote(bucket = "gbif-open-data-eu-central-1",
                    to_duckdb = FALSE)

system.time({
  ex <- gbif %>%
    filter(phylum == "Chordata", year > 2010) %>%
    count(class, year)
})
#>    user  system elapsed 
#>   0.055   0.000   0.056

system.time({  
  ex %>% 
    collect()
})
#>    user  system elapsed 
#> 310.606  28.141 126.655

system.time({
df <- gbif %>% 
  filter(species == "Abies pinsapo",
         countrycode == "ES",
         year > 2020) %>% 
  # count(year) %>%  
  select(species, year, locality, decimallatitude, decimallongitude)
})
#>    user  system elapsed 
#>   0.033   0.003   0.036

system.time({
df %>% 
  collect() 
})
#>    user  system elapsed 
#> 328.496 115.479 426.916

system.time({
df <- gbif %>% 
  filter(species == "Abies pinsapo",
         countrycode == "ES",
         year > 2000) %>% 
  count(year) %>%
  collect()
})
#>    user  system elapsed 
#> 271.148  43.273 220.665

Created on 2022-02-12 by the reprex package (v2.0.1)

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.1.2 (2021-11-01) #> os Ubuntu 20.04.3 LTS #> system x86_64, linux-gnu #> ui X11 #> language en_GB:en #> collate en_GB.UTF-8 #> ctype en_GB.UTF-8 #> tz Europe/Madrid #> date 2022-02-12 #> pandoc 2.11.4 @ /usr/lib/rstudio/bin/pandoc/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> arrow 7.0.0 2022-02-10 [1] RSPM (R 4.1.0) #> assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.1.0) #> bit 4.0.4 2020-08-04 [1] RSPM (R 4.1.0) #> bit64 4.0.5 2020-08-30 [1] RSPM (R 4.1.0) #> cli 3.1.1 2022-01-20 [1] RSPM (R 4.1.0) #> crayon 1.4.2 2021-10-29 [1] CRAN (R 4.1.1) #> DBI 1.1.2 2021-12-20 [1] RSPM (R 4.1.0) #> digest 0.6.29 2021-12-01 [1] RSPM (R 4.1.0) #> dplyr * 1.0.8 2022-02-08 [1] RSPM (R 4.1.0) #> ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.1.0) #> evaluate 0.14 2019-05-28 [1] RSPM (R 4.1.0) #> fansi 1.0.2 2022-01-14 [1] RSPM (R 4.1.0) #> fastmap 1.1.0 2021-01-25 [1] RSPM (R 4.1.0) #> fs 1.5.2 2021-12-08 [1] RSPM (R 4.1.0) #> gbifdb * 0.1.1 2022-02-12 [1] Github (cboettig/gbifdb@7f425f5) #> generics 0.1.2 2022-01-31 [1] RSPM (R 4.1.0) #> glue 1.6.1 2022-01-22 [1] CRAN (R 4.1.2) #> highr 0.9 2021-04-16 [1] RSPM (R 4.1.0) #> htmltools 0.5.2 2021-08-25 [1] RSPM (R 4.1.0) #> knitr 1.37 2021-12-16 [1] RSPM (R 4.1.0) #> lifecycle 1.0.1 2021-09-24 [1] RSPM (R 4.1.0) #> magrittr 2.0.2 2022-01-26 [1] RSPM (R 4.1.0) #> pillar 1.7.0 2022-02-01 [1] RSPM (R 4.1.0) #> pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.1.0) #> purrr 0.3.4 2020-04-17 [1] RSPM (R 4.1.0) #> R6 2.5.1 2021-08-19 [1] RSPM (R 4.1.0) #> reprex 2.0.1 2021-08-05 [1] RSPM (R 4.1.0) #> rlang 1.0.1 2022-02-03 [1] RSPM (R 4.1.0) #> rmarkdown 2.11 2021-09-14 [1] RSPM (R 4.1.0) #> rstudioapi 0.13 2020-11-12 [1] RSPM (R 4.1.0) #> sessioninfo 1.2.2 2021-12-06 [1] RSPM (R 4.1.0) #> stringi 1.7.6 2021-11-29 [1] RSPM (R 4.1.0) #> stringr 1.4.0 2019-02-10 [1] RSPM (R 4.1.0) #> tibble 3.1.6 2021-11-07 [1] RSPM (R 4.1.0) #> tidyselect 1.1.1 2021-04-30 [1] RSPM (R 4.1.0) #> utf8 1.2.2 2021-07-24 [1] RSPM (R 4.1.0) #> vctrs 0.3.8 2021-04-29 [1] RSPM (R 4.1.0) #> withr 2.4.3 2021-11-30 [1] RSPM (R 4.1.0) #> xfun 0.29 2021-12-14 [1] RSPM (R 4.1.0) #> yaml 2.2.2 2022-01-25 [1] CRAN (R 4.1.2) #> #> [1] /home/frs/R/x86_64-pc-linux-gnu-library/4.0 #> [2] /opt/R/4.1.2/lib/R/library #> #> ────────────────────────────────────────────────────────────────────────────── ```

Local database

Querying the local database does work well, and quite fast, but only when using the duckdb backend:

library(gbifdb)
library(dplyr)  

gbif <- gbif_local(dir = "/media/frs/Elements/gbifdb/occurrence.parquet/", 
                   backend = "duckdb")

system.time({
  ex <- gbif %>%
    filter(phylum == "Chordata", year > 2010) %>%
    count(class, year)
  ex
})
#>    user  system elapsed 
#>   0.015   0.000   0.015

If I choose the arrow backend, my R session always crashes (again due to RAM)

Sorry for the bad news. Please let me know if I'm doing something wrong. Otherwise I hope it's not too complicated to fix.

Cheers

P.S. The pkgdown website and the README on github are out of date as README.Rmd has not been knitted.

cboettig commented 2 years ago

Thanks @Pakillo ! Yup, looks like there are still upstream issues in arrow and duckdb being worked out -- e.g. arrow 7.0 R client has a bug in count that causes it to read into memory. I have a handful of open tickets in arrow and they have a great and very responsive team of developers, so I'm hopeful this will improve soon. See:

bizarrely, I don't see duckdb version in your sessionInfo. I have an outstanding issue on duckdb regarding RAM performance on to_duckdb() as well https://github.com/duckdb/duckdb/issues/2768, so I think we may need to wait for that one to resolve upstream as well.

It sounds like we have reasonable functionality as long as a user sticks with arrow for remote connections and duckdb for local connections then? (Ironically I've chosen the opposite setup as the default both times!) I'm still optimistic things will improve with arrow and duckdb as the upstream issues are closed.

stephhazlitt commented 2 years ago

@Pakillo I am curious what happens when you access the data directly with arrow::open_dataset()? Probably the same crash if it is RAM-related, however I am finding that this approach is generally faster on the example dataset and subset of the GBIF data I downloaded.

library(gbifdb)
library(dplyr)

system.time({
  ex <- arrow::open_dataset("/media/frs/Elements/gbifdb/occurrence.parquet/") %>%
    filter(phylum == "Chordata", year > 2010) %>%
    count(class, year)
  ex
})
Pakillo commented 2 years ago

Thanks @cboettig. I reckon these issues stem from upstream dependencies (arrow and duckdb) - hope they are fixed soon.

bizarrely, I don't see duckdb version in your sessionInfo

That's because that in that session I am choosing directly an arrow backend (to_duckdb = FALSE), so it seems duckdb does not get loaded.

@stephhazlitt Thanks for the suggestion! Using arrow::open_dataset, as you suggest, does not crash my R session (while using gbif_local with backend = "arrow" does crash it). Why could that be @cboettig?

library(gbifdb)
library(dplyr)

system.time({
  ex <- arrow::open_dataset("/media/frs/Elements/gbifdb/occurrence.parquet/") %>%
    filter(phylum == "Chordata", year > 2010) %>%
    count(class, year)
  ex
})
#>    user  system elapsed 
#>   0.199   0.017   0.252
cboettig commented 2 years ago

Why could that be @cboettig?

gbif_local() using the arrow backend calls arrow::open_dataset(), but then takes the additional step of registering the connection with duckdb:

https://github.com/cboettig/gbifdb/blob/1b50eb6d83963d2c10e3cfb40ef2ff8fd2df57c3/R/gbif_conn.R#L28-L32

This ought to have an effect similar to arrow::to_duckdb() , but is done via the duckdb::duckdb_register_arrow instead. This more closely parallels the user experience of backend="duckdb". In particular, this means a user can take advantage of all methods, including window functions, that are supported in duckdb, and also makes the argument compatible with gbif_conn(backend="duckdb"), that is, you can get back a connection object. That could be useful if you wanted to database-like operations instead of single-table operations (e.g. join this table on another table in the same connection).

Might be good for us to report the crash upstream to duckdb. What version of duckdb do you get these crashes on?

Pakillo commented 2 years ago

I see... Thanks for the explanation! Then the issues seem to be with duckdb rather than arrow, right?

I'm using latest duckdb available on CRAN (0.3.2)

This is the code that crashes my session:

library(gbifdb)
library(dplyr)  

gbif <- gbif_local(dir = "/media/frs/Elements/gbifdb/occurrence.parquet/", 
                   backend = "arrow")

# This bit below crashes R:
gbif %>%
    filter(phylum == "Chordata", year > 2010) %>%
    count(class, year)

Session info:

─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.1.2 (2021-11-01)
 os       Ubuntu 20.04.3 LTS
 system   x86_64, linux-gnu
 ui       RStudio
 language en_GB:en
 collate  en_GB.UTF-8
 ctype    en_GB.UTF-8
 tz       Europe/Madrid
 date     2022-02-15
 rstudio  1.4.1717 Juliet Rose (desktop)
 pandoc   2.5 @ /usr/bin/pandoc

─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────
 package       * version date (UTC) lib source
 arrow           7.0.0   2022-02-10 [1] RSPM (R 4.1.0)
 assertthat      0.2.1   2019-03-21 [1] RSPM (R 4.1.0)
 aws.s3          0.3.21  2020-04-07 [1] RSPM (R 4.1.0)
 aws.signature   0.6.0   2020-06-01 [1] RSPM (R 4.1.0)
 base64enc       0.1-3   2015-07-28 [1] RSPM (R 4.1.0)
 bit             4.0.4   2020-08-04 [1] RSPM (R 4.1.0)
 bit64           4.0.5   2020-08-30 [1] RSPM (R 4.1.0)
 cli             3.1.1   2022-01-20 [1] RSPM (R 4.1.0)
 codemeta        0.1.1   2021-12-22 [1] RSPM (R 4.1.0)
 crayon          1.4.2   2021-10-29 [1] CRAN (R 4.1.1)
 curl            4.3.2   2021-06-23 [1] RSPM (R 4.1.0)
 DBI             1.1.2   2021-12-20 [1] RSPM (R 4.1.0)
 dbplyr          2.1.1   2021-04-06 [1] RSPM (R 4.1.0)
 digest          0.6.29  2021-12-01 [1] RSPM (R 4.1.0)
 dplyr         * 1.0.8   2022-02-08 [1] RSPM (R 4.1.0)
 duckdb          0.3.2   2022-02-07 [1] RSPM (R 4.1.0)
 ellipsis        0.3.2   2021-04-29 [1] RSPM (R 4.1.0)
 fansi           1.0.2   2022-01-14 [1] RSPM (R 4.1.0)
 gbifdb        * 0.1.1   2022-02-12 [1] Github (cboettig/gbifdb@7f425f5)
 generics        0.1.2   2022-01-31 [1] RSPM (R 4.1.0)
 glue            1.6.1   2022-01-22 [1] CRAN (R 4.1.2)
 httr            1.4.2   2020-07-20 [1] RSPM (R 4.1.0)
 jsonlite        1.7.3   2022-01-17 [1] RSPM (R 4.1.0)
 lifecycle       1.0.1   2021-09-24 [1] RSPM (R 4.1.0)
 magrittr        2.0.2   2022-01-26 [1] RSPM (R 4.1.0)
 pillar          1.7.0   2022-02-01 [1] RSPM (R 4.1.0)
 pkgconfig       2.0.3   2019-09-22 [1] RSPM (R 4.1.0)
 purrr           0.3.4   2020-04-17 [1] RSPM (R 4.1.0)
 R6              2.5.1   2021-08-19 [1] RSPM (R 4.1.0)
 rlang           1.0.1   2022-02-03 [1] RSPM (R 4.1.0)
 sessioninfo     1.2.2   2021-12-06 [1] RSPM (R 4.1.0)
 tibble          3.1.6   2021-11-07 [1] RSPM (R 4.1.0)
 tidyselect      1.1.1   2021-04-30 [1] RSPM (R 4.1.0)
 utf8            1.2.2   2021-07-24 [1] RSPM (R 4.1.0)
 vctrs           0.3.8   2021-04-29 [1] RSPM (R 4.1.0)
 xml2            1.3.3   2021-11-30 [1] RSPM (R 4.1.0)

 [1] /home/frs/R/x86_64-pc-linux-gnu-library/4.0
 [2] /opt/R/4.1.2/lib/R/library
cboettig commented 2 years ago

Thanks! Yes, at the moment, it does look like purely arrow-based methods are working more-or-less as expected, e.g. gbif_remote() with to_duckdb() = FALSE, or a local connection using arrow without using duckdb::duckdb_register_arrow.

I think both of these duckdb issues may be related to the still-open bug in the arrow package, https://issues.apache.org/jira/browse/ARROW-14745 , but honestly not sure, that's above my pay grade! I will try and follow up with the duckdb team.

I do think there's a few outstanding issues in both arrow and duckdb that I'd like to see resolved to feel better about either of these approaches.

On the test:

system.time({
  ex <- arrow::open_dataset("/media/frs/Elements/gbifdb/occurrence.parquet/") %>%
    filter(phylum == "Chordata", year > 2010) %>%
    count(class, year)
  ex
})

I think the implicit print command at the end should force evaluation, but I'm curious if you could just add an explicit call to collect() (or compute()) to confirm it works as expected without crashing?

Also, I'm curious if you call arrow::to_duckdb() on ex, does it work or crash?

For now it seems like it might be reasonable for gbif_local(backend="arrow") to avoid calling duckdb::duckdb_register_arrow at all. This isn't quite as tidy a solution, since it means that gbif_local() won't be just a thin wrapper around gbif_conn(), but that should be invisible to the user anyway. I can also set to_duckdb=FALSE once again as the default in gbif_remote(), thus both functions will give a "pure arrow" experience. (I'll also have to add something to deal with the less-desirable print method of open_dataset() which does not return a tbl class object, and can make duckdb dependency optional then since it's not touched by default settings.

I will try and do some more benchmarking as well.

Pakillo commented 2 years ago

Thanks @cboettig

I'm curious if you could just add an explicit call to collect()

I can confirm using arrow::open_dataset works well, also using collect:

library(gbifdb)
library(dplyr)

arrow::open_dataset("/media/frs/Elements/gbifdb/occurrence.parquet/") %>%
  filter(phylum == "Chordata", year > 2010) %>%
  count(class, year) %>% 
  collect()
#> # A tibble: 163 Γ— 3
#> # Groups:   class [15]
#>    class           year        n
#>    <chr>          <int>    <int>
#>  1 Aves            2015 71131647
#>  2 Reptilia        2012    57146
#>  3 Reptilia        2015    58420
#>  4 Reptilia        2013    91989
#>  5 Reptilia        2017    64181
#>  6 Mammalia        2018   450071
#>  7 Mammalia        2013   492888
#>  8 Actinopterygii  2012  1487182
#>  9 Reptilia        2011    48347
#> 10 Mammalia        2019   452761
#> # … with 153 more rows

Also, I'm curious if you call arrow::to_duckdb() on ex, does it work or crash?

It works!

library(gbifdb)
library(dplyr)

ex <- arrow::open_dataset("/media/frs/Elements/gbifdb/occurrence.parquet/") %>%
  filter(phylum == "Chordata", year > 2010) %>%
  count(class, year)

ex.duckdb <- arrow::to_duckdb(ex)
ex.duckdb
#> # Source:   table<arrow_001> [?? x 3]
#> # Database: duckdb_connection
#> # Groups:   class
#>    class           year        n
#>    <chr>          <int>    <dbl>
#>  1 Aves            2015 71131647
#>  2 Reptilia        2012    57146
#>  3 Reptilia        2015    58420
#>  4 Reptilia        2013    91989
#>  5 Reptilia        2017    64181
#>  6 Mammalia        2018   450071
#>  7 Mammalia        2013   492888
#>  8 Actinopterygii  2012  1487182
#>  9 Reptilia        2011    48347
#> 10 Mammalia        2019   452761
#> # … with more rows
jonkeane commented 2 years ago

I think both of these duckdb issues may be related to the still-open bug in the arrow package, https://issues.apache.org/jira/browse/ARROW-14745 , but honestly not sure, that's above my pay grade! I will try and follow up with the duckdb team.

That issue is closed (though the fix for it is not in the 7.0 release!) It's possible that the crash is being caused by the behavior that's in 7.0 still. Though I will note that that change is only for going from duckdb into arrow (e.g. via to_arrow()).

I haven't looked at the code in GBIF in detail, but at first glance I did not see code that does that. Am I missing something? Or is it that after being read in with arrow and then registered with duckdb, the querying is happening all inside of duckdb?

cboettig commented 2 years ago

@jonkeane thanks so much for joining us on this! Yes, I mean to test on the nightly, as you see @Pakillo is testing against the current 7.0 release. gbifdb does not use to_arrow() anywhere though. gbifdb uses arrow::to_duckdb (optionally) on the remote data source, which seems to result in crashes (though it seems okay if to_duckdb() is used on local parquet instead of over the S3 interface).

gbifdb also uses duckdb::duckdb_register_arrow on an arrow::open_dataset() called on the local parquet files, which seems to cause trouble as well. This seems to be nearly the same thing as calling arrow::to_duckdb() (a quick skim of source code in both cases is that both are creating a "View" in duckdb of the parquet tables, right? so I think either case is handing off to duckdb for the actual queries there-after), so I'm not sure why the one (arrow::to_duckdb) works while the other fails. (Though if that's true, it's not clear to me how arrow::to_duckdb() is working on remote sources in the first place?)

jhollist commented 2 years ago

Just wanted to let you all know that I haven't been ignoring this, just didn't have anything useful to add. At this point it sounds like we are making progress on tracking down the source of the crashes, but still trying to figure out the solution. If there is anything you need from me, just let me know and I will keep an eye on the conversation.

jonkeane commented 2 years ago

gbifdb uses arrow::to_duckdb (optionally) on the remote data source, which seems to result in crashes (though it seems okay if to_duckdb() is used on local parquet instead of over the S3 interface).

Interesting. To be totally honest, I don't think I myself have tried to take an S3-backed dataset and run it through to_duckdb() so it is totally possible that there's an issue there. If you have a reproducer would you mind making a Jira for us?

gbifdb also uses duckdb::duckdb_register_arrow on an arrow::open_dataset() called on the local parquet files, which seems to cause trouble as well. This seems to be nearly the same thing as calling arrow::to_duckdb() (a quick skim of source code in both cases is that both are creating a "View" in duckdb of the parquet tables, right? so I think either case is handing off to duckdb for the actual queries there-after), so I'm not sure why the one (arrow::to_duckdb) works while the other fails. (Though if that's true, it's not clear to me how arrow::to_duckdb() is working on remote sources in the first place?)

You're right that arrow::to_duckdb() is mostly only calling duckdb::duckdb_register_arrow() which registers the View (IIRC in the DuckDB world it's actually an external table or something like that, but those are very similar to views!). We also do a bit of work to make sure there is a duckdb connection already (and create one if not) + wrap the DuckDB reference in tbl() so that it can be used in dplyr queries + setup a finalizer so that we clean the DuckDB reference when the tbl is gc()ed. Any chance you have a stack trace from where the crash is happening?

Pakillo commented 2 years ago

Hi,

I'm happy to help fix this, but I don't know what is needed exactly (beyond what I've shared above). How can I generate a stack trace of those problems? Or can you do that @cboettig?

Also, I wonder: am I the only one getting these crashes? Why? Do you have much larger RAM available?

jhollist commented 2 years ago

@Pakillo

gbif <- gbif_remote(to_duckdb = TRUE)
gbif %>%
    filter(phylum == "Chordata", year > 2010) %>%
    count(class, year)

Does not run on my machine either. I don't ever get to a crash, it hangs. I let it run for several hours last night just to be sure it wasn't just taking long.

@cboettig Do you have a plan for solving these or do we want to put the submission on hold until some of the dependency issues clear up?

cboettig commented 2 years ago

Hi all,

I really appreciate all the help and testing! Just wanted to make sure I'm keeping straight what is and isn't crashing for everyone.

Remote sources

gbifdb has two ways to access data without download:

  1. gbif_remote(to_duckdb = FALSE)
  2. gbif_remote(to_duckdb = TRUE)

Am I correct that everyone sees issues with the second case but not the first on current releases, arrow 7.0 and duckdb 0.3.2?

Local

gbifdb has two ways to do local parquet access currently (1-2 below), but we're also now discussing 2 additional strategies (3-4) by which arrow could access the local dataset:

  1. gbif_local(backend = "arrow") (which is arrow::open_dataset() + duckdb::duckdb_register_arrow)
  2. gbif_local(backend = "duckdb") (which is "pure duckdb")
  3. arrow::open_dataset() + to_duckdb()
  4. arrow::open_dataset() without to_duckdb()

Do I have it correct that: 1 & 3 crash for everyone, but 2 and 4 work for everyone? (I'm double-checking these on my end too with dev versions of the packages)>

I feel like local access strategies 2 & 3 ought to be practically identical, so I'm a bit puzzled there.

Practically speaking, if my summary is correct, I can make is that for gbif_remote(), option 1 is the default, and for gbif_local(), that option 2 is the default. (this makes the remote option "pure arrow" and the local option "pure duckdb", with the minor downside that "pure arrow" doesn't return a tibble or support windowing, but can live with that until to_duckdb is fixed).

I don't want to over-impose on our rOpenSci reviewers & editors here, but really appreciate everyone's perspectives and input in weighing these alternatives.

For the cases that crash, I definitely want to make sure we can get detailed, reproducible bug reports to arrow JIRA and duckdb GitHub trackers.

stephhazlitt commented 2 years ago

@cboettig I am working through the various strategies today and will report back with details.

cboettig commented 2 years ago

fwiw I just noticed gbif_download() wasn't sync'ing correctly, so I've updated that if you want to again attempt downloading the whole thing. https://github.com/cboettig/gbifdb/tree/patch-download

stephhazlitt commented 2 years ago

I was not experiencing the same crashing as others, however as a confirmation, I thought I would try the various paths out again. I installed the patch-download version before running these.

Remote Options

1. arrow::open_dataset()

library(gbifdb)
library(dplyr)
library(arrow)

> ## arrow::open_dataset()
> bucket <- s3_bucket("gbif-open-data-us-east-1/occurrence/2022-02-01/occurrence.parquet")
> gbif_bucket <- open_dataset(bucket)
> 
> system.time({
+ df_arrow <- gbif_bucket |>
+   filter(phylum == "Chordata", year > 2010) |> 
+   count(class, year) |> 
+   collect()
+ df_arrow 
+ })
    user   system  elapsed 
 182.430   35.228 1008.520 

2. gbifdb::gbif_remote(to_duckdb = FALSE)

> ## gbif::gbif_remote w/ arrow
> gbif_arrow <- gbif_remote(bucket = "gbif-open-data-us-east-1",
+                           to_duckdb = FALSE)
> 
> system.time({
+ df_gbif_arrow <- gbif_arrow |>
+   filter(phylum == "Chordata", year > 2010) |> 
+   count(class, year) |> 
+   collect() 
+ df_gbif_arrow
+ })
* Closing connection 0
    user   system  elapsed 
 184.142   36.059 1055.453 

3. gbifdb::gbif_remote(to_duckdb = TRUE)

> gbif_duckdb <- gbif_remote(bucket = "gbif-open-data-us-east-1",
+                         to_duckdb = TRUE)
* Closing connection 0
> 
> system.time({
+ df_gbif_duckdb <- gbif_duckdb |>
+   filter(phylum == "Chordata", year > 2010) |> 
+   count(class, year) |> 
+   collect()
+ df_gbif_duckdb
+ })
    user   system  elapsed 
 283.422   82.892 2137.000 

Session Info

> session_info()
─ Session info ─────────────────────────────────────
 setting  value
 version  R version 4.1.2 (2021-11-01)
 os       macOS Big Sur 11.6
 system   aarch64, darwin20
 ui       RStudio
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/Vancouver
 date     2022-02-18
 rstudio  2021.09.2+382 Ghost Orchid (desktop)
 pandoc   NA

─ Packages ─────────────────────────────────────────
 package     * version date (UTC) lib source
 arrow       * 7.0.0   2022-02-10 [1] CRAN (R 4.1.1)
 assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.1.0)
 bit           4.0.4   2020-08-04 [1] CRAN (R 4.1.1)
 bit64         4.0.5   2020-08-30 [1] CRAN (R 4.1.0)
 blob          1.2.2   2021-07-23 [1] CRAN (R 4.1.0)
 brio          1.1.3   2021-11-30 [1] CRAN (R 4.1.1)
 cachem        1.0.6   2021-08-19 [1] CRAN (R 4.1.1)
 callr         3.7.0   2021-04-20 [1] CRAN (R 4.1.0)
 cli           3.2.0   2022-02-14 [1] CRAN (R 4.1.2)
 crayon        1.5.0   2022-02-14 [1] CRAN (R 4.1.2)
 DBI           1.1.2   2021-12-20 [1] CRAN (R 4.1.2)
 dbplyr        2.1.1   2021-04-06 [1] CRAN (R 4.1.0)
 desc          1.4.0   2021-09-28 [1] CRAN (R 4.1.1)
 devtools    * 2.4.3   2021-11-30 [1] CRAN (R 4.1.1)
 dplyr       * 1.0.8   2022-02-08 [1] CRAN (R 4.1.2)
 duckdb        0.3.2   2022-02-07 [1] CRAN (R 4.1.2)
 ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.0)
 fansi         1.0.2   2022-01-14 [1] CRAN (R 4.1.1)
 fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.1.0)
 fs            1.5.2   2021-12-08 [1] CRAN (R 4.1.2)
 gbifdb      * 0.1.1   2022-02-18 [1] Github (cboettig/gbifdb@51c6d30)
 generics      0.1.2   2022-01-31 [1] CRAN (R 4.1.2)
 glue          1.6.1   2022-01-22 [1] CRAN (R 4.1.1)
 lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.1.1)
 magrittr      2.0.2   2022-01-26 [1] CRAN (R 4.1.1)
 memoise       2.0.1   2021-11-26 [1] CRAN (R 4.1.1)
 pillar        1.7.0   2022-02-01 [1] CRAN (R 4.1.2)
 pkgbuild      1.3.1   2021-12-20 [1] CRAN (R 4.1.2)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.1.0)
 pkgload       1.2.4   2021-11-30 [1] CRAN (R 4.1.1)
 praise        1.0.0   2015-08-11 [1] CRAN (R 4.1.0)
 prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.1.0)
 processx      3.5.2   2021-04-30 [1] CRAN (R 4.1.0)
 ps            1.6.0   2021-02-28 [1] CRAN (R 4.1.0)
 purrr         0.3.4   2020-04-17 [1] CRAN (R 4.1.0)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.1.1)
 remotes       2.4.2   2021-11-30 [1] CRAN (R 4.1.1)
 rlang         1.0.1   2022-02-03 [1] CRAN (R 4.1.2)
 rprojroot     2.0.2   2020-11-15 [1] CRAN (R 4.1.0)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.1)
 testthat    * 3.1.2   2022-01-20 [1] CRAN (R 4.1.1)
 tibble        3.1.6   2021-11-07 [1] CRAN (R 4.1.1)
 tidyselect    1.1.1   2021-04-30 [1] CRAN (R 4.1.0)
 usethis     * 2.1.5   2021-12-09 [1] CRAN (R 4.1.1)
 utf8          1.2.2   2021-07-24 [1] CRAN (R 4.1.0)
 vctrs         0.3.8   2021-04-29 [1] CRAN (R 4.1.0)
 withr         2.4.3   2021-11-30 [1] CRAN (R 4.1.1)

 [1] /Users/stephhazlitt/Rlibrary
 [2] /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/library
Pakillo commented 2 years ago

Hi, here are my results

Remote options

1. arrow::open_dataset() - WORKS

library(gbifdb)
library(dplyr)
library(arrow)

## arrow::open_dataset()
bucket <- s3_bucket("gbif-open-data-eu-central-1/occurrence/2022-02-01/occurrence.parquet")
gbif_bucket <- open_dataset(bucket)

system.time({
  df_arrow <- gbif_bucket |>
    filter(phylum == "Chordata", year > 2010) |>
    count(class, year) |>
    collect()
  df_arrow
})
#>    user  system elapsed 
#> 235.360  27.849 203.743

2. gbifdb::gbif_remote(to_duckdb = FALSE) - WORKS

library(gbifdb)
library(dplyr)

gbif_arrow <- gbif_remote(bucket = "gbif-open-data-eu-central-1",
                          to_duckdb = FALSE)

system.time({
  df_gbif_arrow <- gbif_arrow |>
    filter(phylum == "Chordata", year > 2010) |> 
    count(class, year) |> 
    collect() 
  df_gbif_arrow
})
#>    user  system elapsed 
#> 235.260  26.648 181.525

3. gbifdb::gbif_remote(to_duckdb = TRUE) - CRASH

CRASHES R (due to RAM)

Pakillo commented 2 years ago

Local database

1. gbif_local(backend = "arrow") - CRASH

(which is arrow::open_dataset() + duckdb::duckdb_register_arrow)

gbif <- gbif_local(dir = "/media/frs/Elements/gbifdb/occurrence.parquet/", 
                     backend = "arrow")
gbif %>%
  filter(phylum == "Chordata", year > 2010) %>%
  count(class, year) %>% 
  collect()

CRASH

2. gbif_local(backend = "duckdb") - WORKS

(which is "pure duckdb")

system.time({
  gbif <- gbif_local(dir = "/media/frs/Elements/gbifdb/occurrence.parquet/", 
                     backend = "duckdb")
  gbif %>%
    filter(phylum == "Chordata", year > 2010) %>%
    count(class, year) %>% 
    collect()
})
#>    user  system elapsed 
#>  78.476   2.166  94.556

3. arrow::open_dataset() + to_duckdb() - CRASH

system.time({
  arrow::open_dataset("/media/frs/Elements/gbifdb/occurrence.parquet/") %>% 
  arrow::to_duckdb() %>%
    filter(phylum == "Chordata", year > 2010) %>%
    count(class, year) %>% 
    collect()
})

CRASHES R

4. arrow::open_dataset() without to_duckdb() - WORKS

system.time({
  arrow::open_dataset("/media/frs/Elements/gbifdb/occurrence.parquet/") %>%
    filter(phylum == "Chordata", year > 2010) %>%
    count(class, year) %>% 
    collect()
})
#>    user  system elapsed 
#> 273.518  15.553  40.928

Conclusions

So, the problem seems to be when converting arrow to duckdb

(Note that in a previous example (https://github.com/ropensci/software-review/issues/492#issuecomment-1041361595) converting arrow to duckdb did work well, but I was passing the output of a query, not the entire database)

I feel like local access strategies 2 (gbif_local(backend = "duckdb")) & 3 (arrow::open_dataset() + to_duckdb()) ought to be practically identical, so I'm a bit puzzled there.

From a quick look at the code, it looks like gbif_local with backend = "duckdb" uses duckdb to read database directly, no? So it does avoid the arrow::to_duckdb() conversion (or duckdb::duckdb_register_arrow) which seem to be causing the problems...

stephhazlitt commented 2 years ago

Local Options

library(gbifdb)
library(dplyr)
library(arrow)

gbif_dir <- "/Users/stephhazlitt/dev/gbif-data/occurrence.parquet/"

1. gbif_local(backend = "arrow") (which is arrow::open_dataset() + duckdb::duckdb_register_arrow)

WORKS

# 1. gbif_local(backend = "arrow") (which is arrow::open_dataset() + duckdb::duckdb_register_arrow)
gbif_local(dir = gbif_dir,
           backend = "arrow") |>
  filter(phylum == "Chordata", year > 2010) |>
  count(class, year) |>
  collect()
#> # A tibble: 143 Γ— 3
#> # Groups:   class [15]
#>    class           year     n
#>    <chr>          <int> <dbl>
#>  1 Reptilia        2018 10837
#>  2 Actinopterygii  2019 21428
#>  3 Reptilia        2013  6236
#>  4 Actinopterygii  2014 70435
#>  5 Actinopterygii  2013 91919
#>  6 <NA>            2013    60
#>  7 Reptilia        2014  6128
#>  8 Amphibia        2017  9118
#>  9 Actinopterygii  2016 58846
#> 10 Reptilia        2017  9534
#> # … with 133 more rows

2. gbif_local(backend = "duckdb") (which is "pure duckdb")

WORKS

# 2. gbif_local(backend = "duckdb") (which is "pure duckdb")
gbif_local(dir = gbif_dir,
           backend = "duckdb") |>
  filter(phylum == "Chordata", year > 2010) |>
  count(class, year) |>
  collect()
#> # A tibble: 143 Γ— 3
#> # Groups:   class [15]
#>    class       year     n
#>    <chr>      <int> <dbl>
#>  1 Ascidiacea  2018   187
#>  2 Ascidiacea  2011   302
#>  3 Ascidiacea  2012   276
#>  4 Ascidiacea  2014   757
#>  5 Ascidiacea  2016   335
#>  6 Ascidiacea  2015   392
#>  7 Ascidiacea  2013   169
#>  8 Ascidiacea  2017   394
#>  9 Ascidiacea  2019   161
#> 10 Ascidiacea  2020   144
#> # … with 133 more rows

3. arrow::open_dataset() + to_duckdb()

CRASHES

# 3. arrow::open_dataset() + to_duckdb()
arrow::open_dataset(gbif_dir) |>
  to_duckdb() |> 
  filter(phylum == "Chordata", year > 2010) |>
  count(class, year) |>
  collect()

4. arrow::open_dataset() without to_duckdb()

WORKS

# 4. arrow::open_dataset() without to_duckdb()
arrow::open_dataset(gbif_dir) |>
  filter(phylum == "Chordata", year > 2010) |>
  count(class, year) |>
  collect()
#> # A tibble: 143 Γ— 3
#> # Groups:   class [15]
#>    class           year       n
#>    <chr>          <int>   <int>
#>  1 Aves            2015 1880023
#>  2 Reptilia        2012    5686
#>  3 Reptilia        2015    6219
#>  4 Reptilia        2013    6236
#>  5 Reptilia        2017    9534
#>  6 Mammalia        2018   45356
#>  7 Mammalia        2013   57009
#>  8 Actinopterygii  2012  105466
#>  9 Reptilia        2011    5481
#> 10 Mammalia        2019   58039
#> # … with 133 more rows
stephhazlitt commented 2 years ago

I am not sure why now, as it often just crashed out right, but when I was running the 3. arrow::open_dataset() + to_duckdb() my computer prompted me to force close applications as it was running out of memory -- so I guess it is a RAM issue. No issue with any of the options with the very small example data set:

library(gbifdb)
library(dplyr)
library(arrow)

# 1. gbif_local(backend = "arrow") (which is arrow::open_dataset() + duckdb::duckdb_register_arrow)
gbif_local(dir = gbif_example_data(),
           backend = "arrow") |>
  filter(kingdom == "Plantae") |>
  group_by(class) |> 
  count() |>
  collect()
#> # A tibble: 6 Γ— 2
#>   class              n
#>   <chr>          <dbl>
#> 1 Lycopodiopsida    35
#> 2 Liliopsida       461
#> 3 Magnoliopsida    317
#> 4 Polypodiopsida   135
#> 5 Pinopsida         27
#> 6 Gnetopsida         7

# 2. gbif_local(backend = "duckdb") (which is "pure duckdb")
gbif_local(dir = gbif_example_data(),
           backend = "duckdb") |>
  filter(kingdom == "Plantae") |>
  group_by(class) |> 
  count() |>
  collect()
#> # A tibble: 6 Γ— 2
#>   class              n
#>   <chr>          <dbl>
#> 1 Lycopodiopsida    35
#> 2 Liliopsida       461
#> 3 Magnoliopsida    317
#> 4 Polypodiopsida   135
#> 5 Pinopsida         27
#> 6 Gnetopsida         7

# 3. arrow::open_dataset() + to_duckdb()
arrow::open_dataset(gbif_example_data()) |>
  to_duckdb() |> 
  filter(kingdom == "Plantae") |>
  group_by(class) |> 
  count() |>
  collect()
#> # A tibble: 6 Γ— 2
#>   class              n
#>   <chr>          <dbl>
#> 1 Lycopodiopsida    35
#> 2 Liliopsida       461
#> 3 Magnoliopsida    317
#> 4 Polypodiopsida   135
#> 5 Pinopsida         27
#> 6 Gnetopsida         7

# 4. arrow::open_dataset() without to_duckdb()
arrow::open_dataset(gbif_example_data()) |>
  filter(kingdom == "Plantae") |>
  group_by(class) |> 
  count() |>
  collect()
#> # A tibble: 6 Γ— 2
#> # Groups:   class [6]
#>   class              n
#>   <chr>          <int>
#> 1 Lycopodiopsida    35
#> 2 Liliopsida       461
#> 3 Magnoliopsida    317
#> 4 Polypodiopsida   135
#> 5 Pinopsida         27
#> 6 Gnetopsida         7
jhollist commented 2 years ago

@cboettig Just checking in to see where we are on this. It sounds like we are in a holding pattern until the dependency issues are resolved. It might make sense to put the review on hold until then, unless you think it won't be too much longer. Thoughts?

cboettig commented 2 years ago

@jhollist thanks for the ping!

Good question, yeah, can't really predict how long it might be to see the to_duckdb() or duckdb_register_arrow() mechanisms fixed.

It sounds like we could just make gbif_local() be "pure-duckdb" and gbif_remote() be pure arrow and we would not have any issues. Once those mechanisms stabilize, we could put them back in (I think they have the potential for nicer consistency and performance once they are finally resolved, but kinda a minor thing). Does that sound reasonable to everyone?

jhollist commented 2 years ago

That would work for me. @stephhazlitt and @Pakillo does @cboettig suggestion work for you.

stephhazlitt commented 2 years ago

Very reasonable approach to me.

Pakillo commented 2 years ago

Sounds good to me. Whatever you prefer!

cboettig commented 2 years ago

Thanks! I've switched to the duckdb-only local and arrow-only remote as the defaults, https://github.com/cboettig/gbifdb/pull/6

jhollist commented 2 years ago

@cboettig Thanks for those changes. I can confirm that gbif_remote() works for me locally. I didn't test gbif_local() as I don't have 100GB free! That being said, looks like no one had issues with the pure duckdb versions.

Let me refresh my memory on next steps here, but I think we are very close!

@Pakillo and @stephhazlitt any remaining concerns?

jhollist commented 2 years ago

@cboettig we are good to go!

@Pakillo and @stephhazlitt I went ahead and checked off the two remaining items on your checklists. The subsequent discussion indicates you are both good to go with accepting gbifdb.

Congrats, Carl! Another great contribution.

jhollist commented 2 years ago

@ropensci-review-bot approve gbifdb

ropensci-review-bot commented 2 years ago

Approved! Thanks @cboettig for submitting and @Pakillo, @stephhazlitt for your reviews! :grin:

To-dos:

Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"-type contributors in the Authors@R field (with their consent).

Welcome aboard! We'd love to host a post about your package - either a short introduction to it with an example for a technical audience or a longer post with some narrative about its development or something you learned, and an example of its use for a broader readership. If you are interested, consult the blog guide, and tag @ropensci/blog-editors in your reply. She will get in touch about timing and can answer any questions.

We maintain an online book with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding (with advice on releases, package marketing, GitHub grooming); the guide also feature CRAN gotchas. Please tell us what could be improved.

Last but not least, you can volunteer as a reviewer via filling a short form.