Run 2E9 rows in-ram on EC2

h2oai / db-benchmark

reproducible benchmark of database-like ops

https://h2oai.github.io/db-benchmark

Mozilla Public License 2.0

327 stars 89 forks source link

Run 2E9 rows in-ram on EC2 #71

Closed mattdowle closed 3 years ago

mattdowle commented 5 years ago

db-bench runs on a dedicated machine (provided by H2O) which has 125GB of RAM. So 2E9 won't fit in-ram (the data itself takes 100GB and there's too little working memory left). This machine has fast large disk though and it's much higher priority to test out-of-ram than it is to test bigger RAM; i.e. adding 500GB (1E10) test (https://github.com/h2oai/db-benchmark/issues/39) on the same 125GB RAM db-bench machine where spark and pydatatable will work but the other products will fail. However, for completeness, it would still be nice to know if pandas works now on 2E9 on a node with 250GB RAM (it didn't 4 years ago but data.table did). This issue was moved here from https://github.com/Rdatatable/data.table/issues/823

jangorecki commented 5 years ago

this issue https://github.com/Rdatatable/data.table/issues/2956 can be also confirmed as resolved when doing 2E9 benchmark

jangorecki commented 5 years ago

blocked by https://github.com/tidyverse/dplyr/issues/4334 as of now

jangorecki commented 4 years ago

https://github.com/tidyverse/dplyr/issues/4334 has been recently resolved, once it will land on CRAN we should be good to proceed with this issue.

jangorecki commented 4 years ago

We can wait for dplyr 1.0 to be released as it seems to be the next major version. Pandas got 1.0 version recently also.

jangorecki commented 4 years ago

Need to post-pone that to dplyr 1.1.0. Performance polishing was shifted to 1.1.0 release, and dplyr 1.0 is expected to be slower.

jangorecki commented 4 years ago

It is now blocked on https://github.com/tidyverse/dplyr/issues/5291

jangorecki commented 3 years ago

Same machine as in 2014 was used, 244GB memory. Using recent stable versions as of today.

data.table 1.13.2, R 4.0.3
dplyr 1.0.2, R 4.0.3
pandas 1.1.4, python 3.6

Minor changes to 2014's script:

data.table: added setDTthreads(0L)
dplyr: q4 and q5 updated summarise function, all questions updated group_by for .drop=TRUE

Results:

data.table got internal error during first query: https://github.com/Rdatatable/data.table/issues/4818
dplyr got internal error during first query: https://github.com/tidyverse/dplyr/issues/5291
pandas python process got killed when creating 2e9 dataset, so couldn't even make an attempt to run first query.

jangorecki commented 3 years ago

data.table got the regression fixed in https://github.com/Rdatatable/data.table/pull/4297 I retry tests defined in this issue. dplyr version haven't change since I tried last time so I skipped retrying it. pandas got upgraded to 1.1.5.

Results:

data.table finishes benchmark script successfully now, timings pasted here
pandas 1.1.5 is still being killed, same as on 1.1.4