h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
321 stars 85 forks source link

add arrow solution #189

Closed jangorecki closed 3 years ago

jangorecki commented 3 years ago

It seems that arrow grown enough (or at least close to enough) to be added as a separate solution. Note that multiple solutions that are already in db-benchmark are using arrow as an in-memory data storage, but they all implement own algorithms to compute queries on top of arrow format. This issue is about adding arrow together with its implementation to compute queries.

According to https://arrow.apache.org/docs/r/articles/dataset.html#querying-the-dataset as of now it is still needed to push some computations (like summarise) to R because it is not yet implemented in arrow. This requires to call collect() before doing summarise().

If by any chance a reader knows arrow's open issue that corresponds to this missing feature I will appreciate if it could be posted here so that I can keep checking status of this feature, and when ready, adapt benchmark scripts. I browsed arrow's jira but haven't found anything relevant.

jangorecki commented 3 years ago

It is likely script will have to be updated once https://github.com/tidyverse/dplyr/issues/5763 will be resolved.

jangorecki commented 3 years ago

It also seems that join task will have to be postponed for now due to

Error in UseMethod("inner_join") : 
  no applicable method for 'inner_join' applied to an object of class "c('Table', 'ArrowObject', 'R6')"

Similarly, if anyone have a link to an open issue covering that, please share.

jangorecki commented 3 years ago

Arrow has been added but it seems to be still falling back to dplyr computation engine, we will keep it so over time, once native Arrow implementations will be ready, it will get automatically reflected thanks to automatic upgrades. For "join" task it is not yet even falling back to dplyr.