Repository for reproducible benchmarking of database-like operations in single-node environment.
Benchmark report is available at h2oai.github.io/db-benchmark.
We focused mainly on portability and reproducibility. Benchmark is routinely re-run to present up-to-date timings. Most of solutions used are automatically upgraded to their stable or development versions.
This benchmark is meant to compare scalability both in data volume and data complexity.
Contribution and feedback are very welcome!
More solutions has been proposed. Status of those can be tracked in issues tracker of our project repository by using new solution label.
path.env
and set julia
and java
pathsvirtualenv
as $solution/py-$solution
, example for pandas
use virtualenv pandas/py-pandas --python=/usr/bin/python3.6
$solution/setup-$solution.sh
scriptsrun.conf
to define solutions and tasks to benchmarkgroupby
use Rscript _data/groupby-datagen.R 1e7 1e2 0 0
to create G1_1e7_1e2_0_0.csv
, re-save to binary format where needed (see below), create data
directory and keep all data files there_control/data.csv
to define data sizes to benchmark using active
flag./run.sh
virtualenv
for better isolationlibrary("dplyr", lib.loc="./dplyr/r-dplyr")
or library("data.table", lib.loc="./datatable/r-datatable")
worksdplyr
requires data.table
and similarly pandas
requires (py)datatable
_data/*-datagen.R
scripts, for example, Rscript _data/groupby-datagen.R 1e7 1e2 0 0
creates G1_1e7_1e2_0_0.csv
, put data files in data
directory./_launcher/solution.R --solution=data.table --task=groupby --nrow=1e7
--k=1e2 --na=0 --sort=0
--quiet=true
to suppress script's output and print timings only, using --print=question,run,time_sec
specify columns to be printed to console, to print all use --print=*
--out=time.csv
to write timings to a file rather than console./data
dirSRC_DATANAME=G1_1e7_1e2_0_0 R
, if desired replace R
with python
or julia
cudf
uses conda
instead of virtualenv
Timings for some solutions might be missing for particular data sizes or questions. Some functions are not yet implemented in all solutions so we were unable to answer all questions in all solutions. Some solutions might also run out of memory when running benchmark script which results the process to be killed by OS. Lastly we also added timeout for single benchmark script to run, once timeout value is reached script is terminated. Please check exceptions label in our repository for a list of issues/defects in solutions, that makes us unable to provide all timings. There is also no documentation label that lists issues that are blocked by missing documentation in solutions we are benchmarking.