Closed cboettig closed 2 years ago
@ropensci-review-bot submit review https://github.com/ropensci/software-review/issues/492#issuecomment-1010488630 time 5
Logged review for stephhazlitt (hours: 5)
@ropensci-review-bot submit review https://github.com/ropensci/software-review/issues/492#issuecomment-1012556350 time 12
Couldn't find entry for pakillo in the reviews log
@jhollist this might be a small buffy bug, thanks for your patience :-) https://github.com/ropensci-org/buffy/issues/54
@jhollist could you please try again? thank you!
@ropensci-review-bot submit review #492 (comment) time 12
I'm sorry human, I don't understand that. You can see what commands I support by typing:
@ropensci-review-bot help
@ropensci-review-bot submit review https://github.com/ropensci/software-review/issues/492#issuecomment-1012556350 time 12
Logged review for Pakillo (hours: 12)
Sorry for delay in moving things along, @maelle! Looks like I had a bug too! Works now.
@cboettig Now that I finally got caught up, I have added the "Review In, Awaiting Changes" label. Sorry for slowing things down.
@jhollist Thanks a ton, actually it's a happy mistake as it's the first time this new "sorry human" response happens in this repo :grin:
Thanks all! I'm working on revisions still as noted above, will include a complete report & links to my PR diffs when I've ticked off everything.
Ok, I think I've hit everything now, beween https://github.com/cboettig/gbifdb/pull/3 and https://github.com/cboettig/gbifdb/pull/4 . Summaries on each PR, and maybe the diffs are helpful too.
@Pakillo and @stephhazlitt just wanted to make sure you saw these. @cboettig has two PR's linked in https://github.com/ropensci/software-review/issues/492#issuecomment-1027489887 that should address the issues raised in your reviews. Could you please take a look at these and let us know if everything is good to go?
@Pakillo and @stephhazlitt pinging you both again to make sure that your review comments have been addressed.
Thanks for the (double) ping @jhollist. I have pulled the peer-review/patch2
branch as I think it has all the review-related changes. I really like the new gbif_local()
convenience function @cboettig. I am not sure that the top-level package documentation made it in, so I will put a plug in for that so that ?gbifdb
works for users. Great work on the package! π¦ π π
Hi,
Thanks @cboettig for working out all those changes. The documentation and usage look much nicer.
Unfortunately I'm still having issues trying to make the package work...
First, if I try the remote connection (gbif_remote) using duckdb (the default), my R session invariable crashes whenever I try a consult, as it goes out of RAM (16 GB in my laptop).
So, I can connect successfully to the remote database:
gbif <- gbif_remote(bucket = "gbif-open-data-eu-central-1", to_duckdb = TRUE)
but when I try a consult, like the one in the README or others, my R session crashes:
gbif %>%
filter(phylum == "Chordata", year > 2010) %>%
count(class, year)
The remote connection does work well if I choose an arrow connection. Although consulting the database keeps being a bit slow?
library(gbifdb)
library(dplyr) # optional, for dplyr-based operations
gbif <- gbif_remote(bucket = "gbif-open-data-eu-central-1",
to_duckdb = FALSE)
system.time({
ex <- gbif %>%
filter(phylum == "Chordata", year > 2010) %>%
count(class, year)
})
#> user system elapsed
#> 0.055 0.000 0.056
system.time({
ex %>%
collect()
})
#> user system elapsed
#> 310.606 28.141 126.655
system.time({
df <- gbif %>%
filter(species == "Abies pinsapo",
countrycode == "ES",
year > 2020) %>%
# count(year) %>%
select(species, year, locality, decimallatitude, decimallongitude)
})
#> user system elapsed
#> 0.033 0.003 0.036
system.time({
df %>%
collect()
})
#> user system elapsed
#> 328.496 115.479 426.916
system.time({
df <- gbif %>%
filter(species == "Abies pinsapo",
countrycode == "ES",
year > 2000) %>%
count(year) %>%
collect()
})
#> user system elapsed
#> 271.148 43.273 220.665
Created on 2022-02-12 by the reprex package (v2.0.1)
Querying the local database does work well, and quite fast, but only when using the duckdb
backend:
library(gbifdb)
library(dplyr)
gbif <- gbif_local(dir = "/media/frs/Elements/gbifdb/occurrence.parquet/",
backend = "duckdb")
system.time({
ex <- gbif %>%
filter(phylum == "Chordata", year > 2010) %>%
count(class, year)
ex
})
#> user system elapsed
#> 0.015 0.000 0.015
If I choose the arrow
backend, my R session always crashes (again due to RAM)
Sorry for the bad news. Please let me know if I'm doing something wrong. Otherwise I hope it's not too complicated to fix.
Cheers
P.S. The pkgdown website and the README on github are out of date as README.Rmd
has not been knitted.
Thanks @Pakillo ! Yup, looks like there are still upstream issues in arrow
and duckdb
being worked out -- e.g. arrow
7.0 R client has a bug in count
that causes it to read into memory. I have a handful of open tickets in arrow
and they have a great and very responsive team of developers, so I'm hopeful this will improve soon. See:
bizarrely, I don't see duckdb
version in your sessionInfo
. I have an outstanding issue on duckdb
regarding RAM performance on to_duckdb()
as well https://github.com/duckdb/duckdb/issues/2768, so I think we may need to wait for that one to resolve upstream as well.
It sounds like we have reasonable functionality as long as a user sticks with arrow
for remote connections and duckdb
for local connections then? (Ironically I've chosen the opposite setup as the default both times!) I'm still optimistic things will improve with arrow
and duckdb
as the upstream issues are closed.
@Pakillo I am curious what happens when you access the data directly with arrow::open_dataset()? Probably the same crash if it is RAM-related, however I am finding that this approach is generally faster on the example dataset and subset of the GBIF data I downloaded.
library(gbifdb)
library(dplyr)
system.time({
ex <- arrow::open_dataset("/media/frs/Elements/gbifdb/occurrence.parquet/") %>%
filter(phylum == "Chordata", year > 2010) %>%
count(class, year)
ex
})
Thanks @cboettig. I reckon these issues stem from upstream dependencies (arrow and duckdb) - hope they are fixed soon.
bizarrely, I don't see duckdb version in your sessionInfo
That's because that in that session I am choosing directly an arrow backend (to_duckdb = FALSE
), so it seems duckdb does not get loaded.
@stephhazlitt Thanks for the suggestion! Using arrow::open_dataset
, as you suggest, does not crash my R session (while using gbif_local with backend = "arrow"
does crash it). Why could that be @cboettig?
library(gbifdb)
library(dplyr)
system.time({
ex <- arrow::open_dataset("/media/frs/Elements/gbifdb/occurrence.parquet/") %>%
filter(phylum == "Chordata", year > 2010) %>%
count(class, year)
ex
})
#> user system elapsed
#> 0.199 0.017 0.252
Why could that be @cboettig?
gbif_local()
using the arrow backend calls arrow::open_dataset()
, but then takes the additional step of registering the connection with duckdb
:
This ought to have an effect similar to arrow::to_duckdb()
, but is done via the duckdb::duckdb_register_arrow
instead. This more closely parallels the user experience of backend="duckdb"
. In particular, this means a user can take advantage of all methods, including window functions, that are supported in duckdb
, and also makes the argument compatible with gbif_conn(backend="duckdb")
, that is, you can get back a connection object. That could be useful if you wanted to database-like operations instead of single-table operations (e.g. join this table on another table in the same connection).
Might be good for us to report the crash upstream to duckdb
. What version of duckdb
do you get these crashes on?
I see... Thanks for the explanation! Then the issues seem to be with duckdb rather than arrow, right?
I'm using latest duckdb
available on CRAN (0.3.2)
This is the code that crashes my session:
library(gbifdb)
library(dplyr)
gbif <- gbif_local(dir = "/media/frs/Elements/gbifdb/occurrence.parquet/",
backend = "arrow")
# This bit below crashes R:
gbif %>%
filter(phylum == "Chordata", year > 2010) %>%
count(class, year)
Session info:
β Session info βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
setting value
version R version 4.1.2 (2021-11-01)
os Ubuntu 20.04.3 LTS
system x86_64, linux-gnu
ui RStudio
language en_GB:en
collate en_GB.UTF-8
ctype en_GB.UTF-8
tz Europe/Madrid
date 2022-02-15
rstudio 1.4.1717 Juliet Rose (desktop)
pandoc 2.5 @ /usr/bin/pandoc
β Packages βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
package * version date (UTC) lib source
arrow 7.0.0 2022-02-10 [1] RSPM (R 4.1.0)
assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.1.0)
aws.s3 0.3.21 2020-04-07 [1] RSPM (R 4.1.0)
aws.signature 0.6.0 2020-06-01 [1] RSPM (R 4.1.0)
base64enc 0.1-3 2015-07-28 [1] RSPM (R 4.1.0)
bit 4.0.4 2020-08-04 [1] RSPM (R 4.1.0)
bit64 4.0.5 2020-08-30 [1] RSPM (R 4.1.0)
cli 3.1.1 2022-01-20 [1] RSPM (R 4.1.0)
codemeta 0.1.1 2021-12-22 [1] RSPM (R 4.1.0)
crayon 1.4.2 2021-10-29 [1] CRAN (R 4.1.1)
curl 4.3.2 2021-06-23 [1] RSPM (R 4.1.0)
DBI 1.1.2 2021-12-20 [1] RSPM (R 4.1.0)
dbplyr 2.1.1 2021-04-06 [1] RSPM (R 4.1.0)
digest 0.6.29 2021-12-01 [1] RSPM (R 4.1.0)
dplyr * 1.0.8 2022-02-08 [1] RSPM (R 4.1.0)
duckdb 0.3.2 2022-02-07 [1] RSPM (R 4.1.0)
ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.1.0)
fansi 1.0.2 2022-01-14 [1] RSPM (R 4.1.0)
gbifdb * 0.1.1 2022-02-12 [1] Github (cboettig/gbifdb@7f425f5)
generics 0.1.2 2022-01-31 [1] RSPM (R 4.1.0)
glue 1.6.1 2022-01-22 [1] CRAN (R 4.1.2)
httr 1.4.2 2020-07-20 [1] RSPM (R 4.1.0)
jsonlite 1.7.3 2022-01-17 [1] RSPM (R 4.1.0)
lifecycle 1.0.1 2021-09-24 [1] RSPM (R 4.1.0)
magrittr 2.0.2 2022-01-26 [1] RSPM (R 4.1.0)
pillar 1.7.0 2022-02-01 [1] RSPM (R 4.1.0)
pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.1.0)
purrr 0.3.4 2020-04-17 [1] RSPM (R 4.1.0)
R6 2.5.1 2021-08-19 [1] RSPM (R 4.1.0)
rlang 1.0.1 2022-02-03 [1] RSPM (R 4.1.0)
sessioninfo 1.2.2 2021-12-06 [1] RSPM (R 4.1.0)
tibble 3.1.6 2021-11-07 [1] RSPM (R 4.1.0)
tidyselect 1.1.1 2021-04-30 [1] RSPM (R 4.1.0)
utf8 1.2.2 2021-07-24 [1] RSPM (R 4.1.0)
vctrs 0.3.8 2021-04-29 [1] RSPM (R 4.1.0)
xml2 1.3.3 2021-11-30 [1] RSPM (R 4.1.0)
[1] /home/frs/R/x86_64-pc-linux-gnu-library/4.0
[2] /opt/R/4.1.2/lib/R/library
Thanks! Yes, at the moment, it does look like purely arrow
-based methods are working more-or-less as expected, e.g. gbif_remote()
with to_duckdb()
= FALSE, or a local connection using arrow
without using duckdb::duckdb_register_arrow
.
I think both of these duckdb issues may be related to the still-open bug in the arrow package, https://issues.apache.org/jira/browse/ARROW-14745 , but honestly not sure, that's above my pay grade! I will try and follow up with the duckdb team.
I do think there's a few outstanding issues in both arrow
and duckdb
that I'd like to see resolved to feel better about either of these approaches.
On the test:
system.time({
ex <- arrow::open_dataset("/media/frs/Elements/gbifdb/occurrence.parquet/") %>%
filter(phylum == "Chordata", year > 2010) %>%
count(class, year)
ex
})
I think the implicit print
command at the end should force evaluation, but I'm curious if you could just add an explicit call to collect()
(or compute()
) to confirm it works as expected without crashing?
Also, I'm curious if you call arrow::to_duckdb()
on ex
, does it work or crash?
For now it seems like it might be reasonable for gbif_local(backend="arrow")
to avoid calling duckdb::duckdb_register_arrow
at all. This isn't quite as tidy a solution, since it means that gbif_local()
won't be just a thin wrapper around gbif_conn()
, but that should be invisible to the user anyway. I can also set to_duckdb=FALSE
once again as the default in gbif_remote()
, thus both functions will give a "pure arrow" experience. (I'll also have to add something to deal with the less-desirable print method of open_dataset()
which does not return a tbl
class object, and can make duckdb
dependency optional then since it's not touched by default settings.
I will try and do some more benchmarking as well.
Thanks @cboettig
I'm curious if you could just add an explicit call to collect()
I can confirm using arrow::open_dataset
works well, also using collect
:
library(gbifdb)
library(dplyr)
arrow::open_dataset("/media/frs/Elements/gbifdb/occurrence.parquet/") %>%
filter(phylum == "Chordata", year > 2010) %>%
count(class, year) %>%
collect()
#> # A tibble: 163 Γ 3
#> # Groups: class [15]
#> class year n
#> <chr> <int> <int>
#> 1 Aves 2015 71131647
#> 2 Reptilia 2012 57146
#> 3 Reptilia 2015 58420
#> 4 Reptilia 2013 91989
#> 5 Reptilia 2017 64181
#> 6 Mammalia 2018 450071
#> 7 Mammalia 2013 492888
#> 8 Actinopterygii 2012 1487182
#> 9 Reptilia 2011 48347
#> 10 Mammalia 2019 452761
#> # β¦ with 153 more rows
Also, I'm curious if you call arrow::to_duckdb() on ex, does it work or crash?
It works!
library(gbifdb)
library(dplyr)
ex <- arrow::open_dataset("/media/frs/Elements/gbifdb/occurrence.parquet/") %>%
filter(phylum == "Chordata", year > 2010) %>%
count(class, year)
ex.duckdb <- arrow::to_duckdb(ex)
ex.duckdb
#> # Source: table<arrow_001> [?? x 3]
#> # Database: duckdb_connection
#> # Groups: class
#> class year n
#> <chr> <int> <dbl>
#> 1 Aves 2015 71131647
#> 2 Reptilia 2012 57146
#> 3 Reptilia 2015 58420
#> 4 Reptilia 2013 91989
#> 5 Reptilia 2017 64181
#> 6 Mammalia 2018 450071
#> 7 Mammalia 2013 492888
#> 8 Actinopterygii 2012 1487182
#> 9 Reptilia 2011 48347
#> 10 Mammalia 2019 452761
#> # β¦ with more rows
I think both of these duckdb issues may be related to the still-open bug in the arrow package, https://issues.apache.org/jira/browse/ARROW-14745 , but honestly not sure, that's above my pay grade! I will try and follow up with the duckdb team.
That issue is closed (though the fix for it is not in the 7.0 release!) It's possible that the crash is being caused by the behavior that's in 7.0 still. Though I will note that that change is only for going from duckdb into arrow (e.g. via to_arrow()
).
I haven't looked at the code in GBIF in detail, but at first glance I did not see code that does that. Am I missing something? Or is it that after being read in with arrow and then registered with duckdb, the querying is happening all inside of duckdb?
@jonkeane thanks so much for joining us on this! Yes, I mean to test on the nightly, as you see @Pakillo is testing against the current 7.0 release. gbifdb
does not use to_arrow()
anywhere though. gbifdb
uses arrow::to_duckdb
(optionally) on the remote data source, which seems to result in crashes (though it seems okay if to_duckdb()
is used on local parquet instead of over the S3 interface).
gbifdb
also uses duckdb::duckdb_register_arrow
on an arrow::open_dataset()
called on the local parquet files, which seems to cause trouble as well. This seems to be nearly the same thing as calling arrow::to_duckdb()
(a quick skim of source code in both cases is that both are creating a "View" in duckdb of the parquet tables, right? so I think either case is handing off to duckdb
for the actual queries there-after), so I'm not sure why the one (arrow::to_duckdb
) works while the other fails. (Though if that's true, it's not clear to me how arrow::to_duckdb()
is working on remote sources in the first place?)
Just wanted to let you all know that I haven't been ignoring this, just didn't have anything useful to add. At this point it sounds like we are making progress on tracking down the source of the crashes, but still trying to figure out the solution. If there is anything you need from me, just let me know and I will keep an eye on the conversation.
gbifdb uses arrow::to_duckdb (optionally) on the remote data source, which seems to result in crashes (though it seems okay if to_duckdb() is used on local parquet instead of over the S3 interface).
Interesting. To be totally honest, I don't think I myself have tried to take an S3-backed dataset and run it through to_duckdb()
so it is totally possible that there's an issue there. If you have a reproducer would you mind making a Jira for us?
gbifdb also uses duckdb::duckdb_register_arrow on an arrow::open_dataset() called on the local parquet files, which seems to cause trouble as well. This seems to be nearly the same thing as calling arrow::to_duckdb() (a quick skim of source code in both cases is that both are creating a "View" in duckdb of the parquet tables, right? so I think either case is handing off to duckdb for the actual queries there-after), so I'm not sure why the one (arrow::to_duckdb) works while the other fails. (Though if that's true, it's not clear to me how arrow::to_duckdb() is working on remote sources in the first place?)
You're right that arrow::to_duckdb()
is mostly only calling duckdb::duckdb_register_arrow()
which registers the View (IIRC in the DuckDB world it's actually an external table or something like that, but those are very similar to views!). We also do a bit of work to make sure there is a duckdb connection already (and create one if not) + wrap the DuckDB reference in tbl()
so that it can be used in dplyr queries + setup a finalizer so that we clean the DuckDB reference when the tbl
is gc()
ed. Any chance you have a stack trace from where the crash is happening?
Hi,
I'm happy to help fix this, but I don't know what is needed exactly (beyond what I've shared above). How can I generate a stack trace of those problems? Or can you do that @cboettig?
Also, I wonder: am I the only one getting these crashes? Why? Do you have much larger RAM available?
@Pakillo
gbif <- gbif_remote(to_duckdb = TRUE)
gbif %>%
filter(phylum == "Chordata", year > 2010) %>%
count(class, year)
Does not run on my machine either. I don't ever get to a crash, it hangs. I let it run for several hours last night just to be sure it wasn't just taking long.
@cboettig Do you have a plan for solving these or do we want to put the submission on hold until some of the dependency issues clear up?
Hi all,
I really appreciate all the help and testing! Just wanted to make sure I'm keeping straight what is and isn't crashing for everyone.
gbifdb
has two ways to access data without download:
gbif_remote(to_duckdb = FALSE)
gbif_remote(to_duckdb = TRUE)
Am I correct that everyone sees issues with the second case but not the first on current releases, arrow
7.0 and duckdb 0.3.2?
gbifdb
has two ways to do local parquet access currently (1-2 below), but we're also now discussing 2 additional strategies (3-4) by which arrow could access the local dataset:
gbif_local(backend = "arrow")
(which is arrow::open_dataset()
+ duckdb::duckdb_register_arrow
)gbif_local(backend = "duckdb")
(which is "pure duckdb")arrow::open_dataset()
+ to_duckdb()
arrow::open_dataset()
without to_duckdb()
Do I have it correct that: 1 & 3 crash for everyone, but 2 and 4 work for everyone? (I'm double-checking these on my end too with dev versions of the packages)>
I feel like local access strategies 2 & 3 ought to be practically identical, so I'm a bit puzzled there.
Practically speaking, if my summary is correct, I can make is that for gbif_remote()
, option 1 is the default, and for gbif_local()
, that option 2 is the default. (this makes the remote option "pure arrow" and the local option "pure duckdb", with the minor downside that "pure arrow" doesn't return a tibble or support windowing, but can live with that until to_duckdb
is fixed).
I don't want to over-impose on our rOpenSci reviewers & editors here, but really appreciate everyone's perspectives and input in weighing these alternatives.
For the cases that crash, I definitely want to make sure we can get detailed, reproducible bug reports to arrow JIRA and duckdb GitHub trackers.
@cboettig I am working through the various strategies today and will report back with details.
fwiw I just noticed gbif_download()
wasn't sync'ing correctly, so I've updated that if you want to again attempt downloading the whole thing. https://github.com/cboettig/gbifdb/tree/patch-download
I was not experiencing the same crashing as others, however as a confirmation, I thought I would try the various paths out again. I installed the patch-download
version before running these.
library(gbifdb)
library(dplyr)
library(arrow)
> ## arrow::open_dataset()
> bucket <- s3_bucket("gbif-open-data-us-east-1/occurrence/2022-02-01/occurrence.parquet")
> gbif_bucket <- open_dataset(bucket)
>
> system.time({
+ df_arrow <- gbif_bucket |>
+ filter(phylum == "Chordata", year > 2010) |>
+ count(class, year) |>
+ collect()
+ df_arrow
+ })
user system elapsed
182.430 35.228 1008.520
> ## gbif::gbif_remote w/ arrow
> gbif_arrow <- gbif_remote(bucket = "gbif-open-data-us-east-1",
+ to_duckdb = FALSE)
>
> system.time({
+ df_gbif_arrow <- gbif_arrow |>
+ filter(phylum == "Chordata", year > 2010) |>
+ count(class, year) |>
+ collect()
+ df_gbif_arrow
+ })
* Closing connection 0
user system elapsed
184.142 36.059 1055.453
> gbif_duckdb <- gbif_remote(bucket = "gbif-open-data-us-east-1",
+ to_duckdb = TRUE)
* Closing connection 0
>
> system.time({
+ df_gbif_duckdb <- gbif_duckdb |>
+ filter(phylum == "Chordata", year > 2010) |>
+ count(class, year) |>
+ collect()
+ df_gbif_duckdb
+ })
user system elapsed
283.422 82.892 2137.000
> session_info()
β Session info βββββββββββββββββββββββββββββββββββββ
setting value
version R version 4.1.2 (2021-11-01)
os macOS Big Sur 11.6
system aarch64, darwin20
ui RStudio
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/Vancouver
date 2022-02-18
rstudio 2021.09.2+382 Ghost Orchid (desktop)
pandoc NA
β Packages βββββββββββββββββββββββββββββββββββββββββ
package * version date (UTC) lib source
arrow * 7.0.0 2022-02-10 [1] CRAN (R 4.1.1)
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0)
bit 4.0.4 2020-08-04 [1] CRAN (R 4.1.1)
bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.0)
blob 1.2.2 2021-07-23 [1] CRAN (R 4.1.0)
brio 1.1.3 2021-11-30 [1] CRAN (R 4.1.1)
cachem 1.0.6 2021-08-19 [1] CRAN (R 4.1.1)
callr 3.7.0 2021-04-20 [1] CRAN (R 4.1.0)
cli 3.2.0 2022-02-14 [1] CRAN (R 4.1.2)
crayon 1.5.0 2022-02-14 [1] CRAN (R 4.1.2)
DBI 1.1.2 2021-12-20 [1] CRAN (R 4.1.2)
dbplyr 2.1.1 2021-04-06 [1] CRAN (R 4.1.0)
desc 1.4.0 2021-09-28 [1] CRAN (R 4.1.1)
devtools * 2.4.3 2021-11-30 [1] CRAN (R 4.1.1)
dplyr * 1.0.8 2022-02-08 [1] CRAN (R 4.1.2)
duckdb 0.3.2 2022-02-07 [1] CRAN (R 4.1.2)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
fansi 1.0.2 2022-01-14 [1] CRAN (R 4.1.1)
fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0)
fs 1.5.2 2021-12-08 [1] CRAN (R 4.1.2)
gbifdb * 0.1.1 2022-02-18 [1] Github (cboettig/gbifdb@51c6d30)
generics 0.1.2 2022-01-31 [1] CRAN (R 4.1.2)
glue 1.6.1 2022-01-22 [1] CRAN (R 4.1.1)
lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.1)
magrittr 2.0.2 2022-01-26 [1] CRAN (R 4.1.1)
memoise 2.0.1 2021-11-26 [1] CRAN (R 4.1.1)
pillar 1.7.0 2022-02-01 [1] CRAN (R 4.1.2)
pkgbuild 1.3.1 2021-12-20 [1] CRAN (R 4.1.2)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
pkgload 1.2.4 2021-11-30 [1] CRAN (R 4.1.1)
praise 1.0.0 2015-08-11 [1] CRAN (R 4.1.0)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.1.0)
processx 3.5.2 2021-04-30 [1] CRAN (R 4.1.0)
ps 1.6.0 2021-02-28 [1] CRAN (R 4.1.0)
purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.1)
remotes 2.4.2 2021-11-30 [1] CRAN (R 4.1.1)
rlang 1.0.1 2022-02-03 [1] CRAN (R 4.1.2)
rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.1.0)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.1)
testthat * 3.1.2 2022-01-20 [1] CRAN (R 4.1.1)
tibble 3.1.6 2021-11-07 [1] CRAN (R 4.1.1)
tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0)
usethis * 2.1.5 2021-12-09 [1] CRAN (R 4.1.1)
utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0)
vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0)
withr 2.4.3 2021-11-30 [1] CRAN (R 4.1.1)
[1] /Users/stephhazlitt/Rlibrary
[2] /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/library
Hi, here are my results
library(gbifdb)
library(dplyr)
library(arrow)
## arrow::open_dataset()
bucket <- s3_bucket("gbif-open-data-eu-central-1/occurrence/2022-02-01/occurrence.parquet")
gbif_bucket <- open_dataset(bucket)
system.time({
df_arrow <- gbif_bucket |>
filter(phylum == "Chordata", year > 2010) |>
count(class, year) |>
collect()
df_arrow
})
#> user system elapsed
#> 235.360 27.849 203.743
library(gbifdb)
library(dplyr)
gbif_arrow <- gbif_remote(bucket = "gbif-open-data-eu-central-1",
to_duckdb = FALSE)
system.time({
df_gbif_arrow <- gbif_arrow |>
filter(phylum == "Chordata", year > 2010) |>
count(class, year) |>
collect()
df_gbif_arrow
})
#> user system elapsed
#> 235.260 26.648 181.525
CRASHES R (due to RAM)
(which is arrow::open_dataset() + duckdb::duckdb_register_arrow)
gbif <- gbif_local(dir = "/media/frs/Elements/gbifdb/occurrence.parquet/",
backend = "arrow")
gbif %>%
filter(phylum == "Chordata", year > 2010) %>%
count(class, year) %>%
collect()
CRASH
(which is "pure duckdb")
system.time({
gbif <- gbif_local(dir = "/media/frs/Elements/gbifdb/occurrence.parquet/",
backend = "duckdb")
gbif %>%
filter(phylum == "Chordata", year > 2010) %>%
count(class, year) %>%
collect()
})
#> user system elapsed
#> 78.476 2.166 94.556
system.time({
arrow::open_dataset("/media/frs/Elements/gbifdb/occurrence.parquet/") %>%
arrow::to_duckdb() %>%
filter(phylum == "Chordata", year > 2010) %>%
count(class, year) %>%
collect()
})
CRASHES R
system.time({
arrow::open_dataset("/media/frs/Elements/gbifdb/occurrence.parquet/") %>%
filter(phylum == "Chordata", year > 2010) %>%
count(class, year) %>%
collect()
})
#> user system elapsed
#> 273.518 15.553 40.928
gbif_remote works well when to_duckdb = FALSE.
gbif_local works well when backend = "duckdb"
So, the problem seems to be when converting arrow to duckdb
(Note that in a previous example (https://github.com/ropensci/software-review/issues/492#issuecomment-1041361595) converting arrow to duckdb did work well, but I was passing the output of a query, not the entire database)
I feel like local access strategies 2 (gbif_local(backend = "duckdb")) & 3 (arrow::open_dataset() + to_duckdb()) ought to be practically identical, so I'm a bit puzzled there.
From a quick look at the code, it looks like gbif_local with backend = "duckdb" uses duckdb to read database directly, no? So it does avoid the arrow::to_duckdb() conversion (or duckdb::duckdb_register_arrow) which seem to be causing the problems...
library(gbifdb)
library(dplyr)
library(arrow)
gbif_dir <- "/Users/stephhazlitt/dev/gbif-data/occurrence.parquet/"
1. gbif_local(backend = "arrow") (which is arrow::open_dataset() + duckdb::duckdb_register_arrow)
WORKS
# 1. gbif_local(backend = "arrow") (which is arrow::open_dataset() + duckdb::duckdb_register_arrow)
gbif_local(dir = gbif_dir,
backend = "arrow") |>
filter(phylum == "Chordata", year > 2010) |>
count(class, year) |>
collect()
#> # A tibble: 143 Γ 3
#> # Groups: class [15]
#> class year n
#> <chr> <int> <dbl>
#> 1 Reptilia 2018 10837
#> 2 Actinopterygii 2019 21428
#> 3 Reptilia 2013 6236
#> 4 Actinopterygii 2014 70435
#> 5 Actinopterygii 2013 91919
#> 6 <NA> 2013 60
#> 7 Reptilia 2014 6128
#> 8 Amphibia 2017 9118
#> 9 Actinopterygii 2016 58846
#> 10 Reptilia 2017 9534
#> # β¦ with 133 more rows
2. gbif_local(backend = "duckdb") (which is "pure duckdb")
WORKS
# 2. gbif_local(backend = "duckdb") (which is "pure duckdb")
gbif_local(dir = gbif_dir,
backend = "duckdb") |>
filter(phylum == "Chordata", year > 2010) |>
count(class, year) |>
collect()
#> # A tibble: 143 Γ 3
#> # Groups: class [15]
#> class year n
#> <chr> <int> <dbl>
#> 1 Ascidiacea 2018 187
#> 2 Ascidiacea 2011 302
#> 3 Ascidiacea 2012 276
#> 4 Ascidiacea 2014 757
#> 5 Ascidiacea 2016 335
#> 6 Ascidiacea 2015 392
#> 7 Ascidiacea 2013 169
#> 8 Ascidiacea 2017 394
#> 9 Ascidiacea 2019 161
#> 10 Ascidiacea 2020 144
#> # β¦ with 133 more rows
3. arrow::open_dataset() + to_duckdb()
CRASHES
# 3. arrow::open_dataset() + to_duckdb()
arrow::open_dataset(gbif_dir) |>
to_duckdb() |>
filter(phylum == "Chordata", year > 2010) |>
count(class, year) |>
collect()
4. arrow::open_dataset() without to_duckdb()
WORKS
# 4. arrow::open_dataset() without to_duckdb()
arrow::open_dataset(gbif_dir) |>
filter(phylum == "Chordata", year > 2010) |>
count(class, year) |>
collect()
#> # A tibble: 143 Γ 3
#> # Groups: class [15]
#> class year n
#> <chr> <int> <int>
#> 1 Aves 2015 1880023
#> 2 Reptilia 2012 5686
#> 3 Reptilia 2015 6219
#> 4 Reptilia 2013 6236
#> 5 Reptilia 2017 9534
#> 6 Mammalia 2018 45356
#> 7 Mammalia 2013 57009
#> 8 Actinopterygii 2012 105466
#> 9 Reptilia 2011 5481
#> 10 Mammalia 2019 58039
#> # β¦ with 133 more rows
I am not sure why now, as it often just crashed out right, but when I was running the 3. arrow::open_dataset() + to_duckdb()
my computer prompted me to force close applications as it was running out of memory -- so I guess it is a RAM issue. No issue with any of the options with the very small example data set:
library(gbifdb)
library(dplyr)
library(arrow)
# 1. gbif_local(backend = "arrow") (which is arrow::open_dataset() + duckdb::duckdb_register_arrow)
gbif_local(dir = gbif_example_data(),
backend = "arrow") |>
filter(kingdom == "Plantae") |>
group_by(class) |>
count() |>
collect()
#> # A tibble: 6 Γ 2
#> class n
#> <chr> <dbl>
#> 1 Lycopodiopsida 35
#> 2 Liliopsida 461
#> 3 Magnoliopsida 317
#> 4 Polypodiopsida 135
#> 5 Pinopsida 27
#> 6 Gnetopsida 7
# 2. gbif_local(backend = "duckdb") (which is "pure duckdb")
gbif_local(dir = gbif_example_data(),
backend = "duckdb") |>
filter(kingdom == "Plantae") |>
group_by(class) |>
count() |>
collect()
#> # A tibble: 6 Γ 2
#> class n
#> <chr> <dbl>
#> 1 Lycopodiopsida 35
#> 2 Liliopsida 461
#> 3 Magnoliopsida 317
#> 4 Polypodiopsida 135
#> 5 Pinopsida 27
#> 6 Gnetopsida 7
# 3. arrow::open_dataset() + to_duckdb()
arrow::open_dataset(gbif_example_data()) |>
to_duckdb() |>
filter(kingdom == "Plantae") |>
group_by(class) |>
count() |>
collect()
#> # A tibble: 6 Γ 2
#> class n
#> <chr> <dbl>
#> 1 Lycopodiopsida 35
#> 2 Liliopsida 461
#> 3 Magnoliopsida 317
#> 4 Polypodiopsida 135
#> 5 Pinopsida 27
#> 6 Gnetopsida 7
# 4. arrow::open_dataset() without to_duckdb()
arrow::open_dataset(gbif_example_data()) |>
filter(kingdom == "Plantae") |>
group_by(class) |>
count() |>
collect()
#> # A tibble: 6 Γ 2
#> # Groups: class [6]
#> class n
#> <chr> <int>
#> 1 Lycopodiopsida 35
#> 2 Liliopsida 461
#> 3 Magnoliopsida 317
#> 4 Polypodiopsida 135
#> 5 Pinopsida 27
#> 6 Gnetopsida 7
@cboettig Just checking in to see where we are on this. It sounds like we are in a holding pattern until the dependency issues are resolved. It might make sense to put the review on hold until then, unless you think it won't be too much longer. Thoughts?
@jhollist thanks for the ping!
Good question, yeah, can't really predict how long it might be to see the to_duckdb()
or duckdb_register_arrow()
mechanisms fixed.
It sounds like we could just make gbif_local()
be "pure-duckdb" and gbif_remote()
be pure arrow
and we would not have any issues. Once those mechanisms stabilize, we could put them back in (I think they have the potential for nicer consistency and performance once they are finally resolved, but kinda a minor thing). Does that sound reasonable to everyone?
That would work for me. @stephhazlitt and @Pakillo does @cboettig suggestion work for you.
Very reasonable approach to me.
Sounds good to me. Whatever you prefer!
Thanks! I've switched to the duckdb-only local and arrow-only remote as the defaults, https://github.com/cboettig/gbifdb/pull/6
@cboettig Thanks for those changes. I can confirm that gbif_remote() works for me locally. I didn't test gbif_local() as I don't have 100GB free! That being said, looks like no one had issues with the pure duckdb versions.
Let me refresh my memory on next steps here, but I think we are very close!
@Pakillo and @stephhazlitt any remaining concerns?
@cboettig we are good to go!
@Pakillo and @stephhazlitt I went ahead and checked off the two remaining items on your checklists. The subsequent discussion indicates you are both good to go with accepting gbifdb
.
Congrats, Carl! Another great contribution.
@ropensci-review-bot approve gbifdb
Approved! Thanks @cboettig for submitting and @Pakillo, @stephhazlitt for your reviews! :grin:
To-dos:
@ropensci-review-bot finalize transfer of <package-name>
where <package-name>
is the repo/package name. This will give you admin access back.pkgdown
website and are ok relying only on rOpenSci central docs building and branding,
pkgdown
website with a redirecting pagehttps://docs.ropensci.org/package_name
URL
field alongside the link to the GitHub repository, e.g.: URL: https://docs.ropensci.org/foobar (website) https://github.com/ropensci/foobar
codemetar::write_codemeta()
in the root of your package.install.packages("<package-name>", repos = "https://ropensci.r-universe.dev")
thanks to R-universe.Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"
-type contributors in the Authors@R
field (with their consent).
Welcome aboard! We'd love to host a post about your package - either a short introduction to it with an example for a technical audience or a longer post with some narrative about its development or something you learned, and an example of its use for a broader readership. If you are interested, consult the blog guide, and tag @ropensci/blog-editors in your reply. She will get in touch about timing and can answer any questions.
We maintain an online book with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding (with advice on releases, package marketing, GitHub grooming); the guide also feature CRAN gotchas. Please tell us what could be improved.
Last but not least, you can volunteer as a reviewer via filling a short form.
Date accepted: 2022-03-22 Submitting Author Name: Carl Boettiger Submitting Author Github Handle: !--author1-->@cboettig<!--end-author1-- Repository: https://github.com/cboettig/gbifdb Version submitted: 0.1.0 Submission type: Standard Editor: !--editor-->@jhollist<!--end-editor-- Reviewers: @Pakillo, @stephhazlitt
Due date for @Pakillo: 2022-01-10 Due date for @stephhazlitt: 2022-01-10Archive: TBD Version accepted: TBD Language: en
Scope
Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
Explain how and why the package falls under these categories (briefly, 1-2 sentences):
This package is similar in scope to
rgbif
, but allows direct access to the full occurrence table usingdplyr
functions with or without downloading.Ecologists etc using / analyzing GBIF data. In particular, anyone attempting analyses which need the whole GBIF data corpus, like the example shown in the README, cannot easily do so with the standard API / rgbif approach.
As noted above, this is similar in scope to rgbif, but more general and faster.
(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?
If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.
Technical checks
Confirm each of the following by checking the box.
This package:
Publication options
[x] Do you intend for this package to go on CRAN?
[ ] Do you intend for this package to go on Bioconductor?
[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:
MEE Options
- [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)Code of conduct