Closed jangorecki closed 3 years ago
It is likely script will have to be updated once https://github.com/tidyverse/dplyr/issues/5763 will be resolved.
It also seems that join task will have to be postponed for now due to
Error in UseMethod("inner_join") :
no applicable method for 'inner_join' applied to an object of class "c('Table', 'ArrowObject', 'R6')"
Similarly, if anyone have a link to an open issue covering that, please share.
Arrow has been added but it seems to be still falling back to dplyr computation engine, we will keep it so over time, once native Arrow implementations will be ready, it will get automatically reflected thanks to automatic upgrades. For "join" task it is not yet even falling back to dplyr.
It seems that arrow grown enough (or at least close to enough) to be added as a separate solution. Note that multiple solutions that are already in db-benchmark are using arrow as an in-memory data storage, but they all implement own algorithms to compute queries on top of arrow format. This issue is about adding arrow together with its implementation to compute queries.
According to https://arrow.apache.org/docs/r/articles/dataset.html#querying-the-dataset as of now it is still needed to push some computations (like
summarise
) to R because it is not yet implemented in arrow. This requires to callcollect()
before doingsummarise()
.If by any chance a reader knows arrow's open issue that corresponds to this missing feature I will appreciate if it could be posted here so that I can keep checking status of this feature, and when ready, adapt benchmark scripts. I browsed arrow's jira but haven't found anything relevant.