Closed mattdowle closed 3 years ago
this issue https://github.com/Rdatatable/data.table/issues/2956 can be also confirmed as resolved when doing 2E9 benchmark
blocked by https://github.com/tidyverse/dplyr/issues/4334 as of now
https://github.com/tidyverse/dplyr/issues/4334 has been recently resolved, once it will land on CRAN we should be good to proceed with this issue.
We can wait for dplyr 1.0 to be released as it seems to be the next major version. Pandas got 1.0 version recently also.
Need to post-pone that to dplyr 1.1.0. Performance polishing was shifted to 1.1.0 release, and dplyr 1.0 is expected to be slower.
It is now blocked on https://github.com/tidyverse/dplyr/issues/5291
Same machine as in 2014 was used, 244GB memory. Using recent stable versions as of today.
Minor changes to 2014's script:
Results:
data.table got the regression fixed in https://github.com/Rdatatable/data.table/pull/4297 I retry tests defined in this issue. dplyr version haven't change since I tried last time so I skipped retrying it. pandas got upgraded to 1.1.5.
Results:
db-bench runs on a dedicated machine (provided by H2O) which has 125GB of RAM. So 2E9 won't fit in-ram (the data itself takes 100GB and there's too little working memory left). This machine has fast large disk though and it's much higher priority to test out-of-ram than it is to test bigger RAM; i.e. adding 500GB (1E10) test (https://github.com/h2oai/db-benchmark/issues/39) on the same 125GB RAM db-bench machine where spark and pydatatable will work but the other products will fail. However, for completeness, it would still be nice to know if pandas works now on 2E9 on a node with 250GB RAM (it didn't 4 years ago but data.table did). This issue was moved here from https://github.com/Rdatatable/data.table/issues/823