ropensci / taxadb

:package: Taxonomic Database
https://docs.ropensci.org/taxadb
Other
43 stars 13 forks source link

safe_right_join consuming memory #95

Open lucas-jardim opened 3 years ago

lucas-jardim commented 3 years ago

I am using filter_name function but all my memory (7GB) has been consumed by dplyr::left_join in safe_right_join function. I am using the following example:

taxadb::filter_name(c("Ocotea odorifera", "Ocotea odorifere"), "gbif")

There is a way to limit the memory usage, I tried options(duckdb_memory_limit=1) supposing it is a duckdb issue, but it did not work.

Att,

cboettig commented 3 years ago

thanks for the report. Looks like I can reproduce that, will see if I can streamline memory use (we should be able to drop the use of safe_right_join() except for RSQLite backends).

I don't think the duckdb R package recognizes that option actually. I think you have to manually send the PRAGMA to duckdb over the DBI interface, like this:


db <- taxadb::td_connect()
pragma <- paste0("PRAGMA memory_limit='", 1, "GB'")
DBI::dbExecute(db, pragma)
taxadb::filter_name(c("Ocotea odorifera", "Ocotea odorifere"), "gbif")

Still may not reduce memory use as much as desired.

You might also want to try out taxalight if you mostly just need filter_name / filter_id. These are a single function, taxalight:tl, which uses LMDB which should give a much faster at lookup with almost no memory footprint (I see 422KB used):

bench::mark({
taxalight::tl(c("Ocotea odorifera", "Ocotea odorifere"), "gbif")
})
#> # A tibble: 1 × 6
#>   expression                                                                min
#>   <bch:expr>                                                           <bch:tm>
#> 1 { taxalight::tl(c("Ocotea odorifera", "Ocotea odorifere"), "gbif") }     27ms
#> # … with 4 more variables: median <bch:tm>, itr/sec <dbl>, mem_alloc <bch:byt>,
#> #   gc/sec <dbl>

Created on 2021-08-06 by the reprex package (v2.0.0)

see: https://github.com/cboettig/taxalight

cboettig commented 3 years ago

Just another follow-up. Another option is to stick with taxadb but swap out the backend database for MonetDBLite. The taxadb approach was originally optimized for MonetDBLite. duckdb is developed by the same team and supposed to be the successor, but in many ways really doesn't compare yet. You can install and set MonetDBLite as the backend by doing:

remotes::install_version("MonetDBLite", "0.6.0")
Sys.setenv("TAXADB_DRIVER"="MonetDBLite")

and then rerun you commands to import GBIF and do the local queries. This way, memory use is less than with duckdb and queries are faster, though still not comparable to taxalight

lucas-jardim commented 3 years ago

Thank you Carl, I am going to try both options you suggested. I did not know taxalight, it seems a great alternative to filter_name.