h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
323 stars 85 forks source link

Download data? #87

Closed hadley closed 5 years ago

hadley commented 5 years ago

I'd love to be able to get the test data that you use, as well as the data underlying the results. Are these available in the repo somewhere? I clicked around a bit, but I couldn't see anything obvious.

jangorecki commented 5 years ago

Data we are using are basically the same data as the one in grouping benchmark from 2014. To generate data use groupby-datagen.R script.

Rscript groupby-datagen.R 1e8 1e2 0 0

You might also find repro.sh script useful.


Note that dplyr was performing significantly better before, which I reported in https://github.com/tidyverse/dplyr/issues/4334. Would be great if you could raise priority of that issue, so it can at least catch up with its previous speed. To preview timings over time you can run the following code. There is git commit sha stored in timings logs so it can give more precise information at which point in git history performance regression has been introduced.

library(data.table)
d = fread("https://h2oai.github.io/db-benchmark/time.csv", key="batch")[solution=="dplyr" & nodename=="mr-0xc11" & task=="groupby" & run==1L & data%like%"1e2_0_0"]
d = d[, .(batch=as.POSIXct(batch, origin="1970-01-01", tz="UTC"), data=factor(data), question=factor(question, levels=unique(question)), version, git, time_sec)]
lattice::xyplot(time_sec ~ batch | data, d, type="l", groups=question,
                xlab = "benchmark run", ylab = "seconds",
                scales=list(y=list(relation="free")), auto.key=list(points=FALSE, lines=TRUE))

Rplot

hadley commented 5 years ago

It would be very useful to just have a link to download the data.

jangorecki commented 5 years ago

@hadley links to download timings data are there on the report website https://h2oai.github.io/db-benchmark/ just below the plots, in Notes section. Data on which benchmarks are run on are not that feasible for downloading because of the size. As of now the 1e9 rows datasets are ~200GB, when we add 1e10 rows it will be ~2TB. Also adding datasets having NAs will increase overall size. When we will add joins and other tasks we will get even more. In many cases it will be even faster to just run an R script than downloading data. Now we can also more easily scale for more datasets, like adding those having NAs.

hadley commented 5 years ago

Ah ok, that makes sense. Thanks!