h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
322 stars 85 forks source link

Make command-line script for running the benchmark easily #130

Closed st-pasha closed 4 years ago

st-pasha commented 4 years ago

Currently the steps to reproduce the benchmark are unnecessarily complicated: they require setting up environment variables, virtual environments, running preparatory steps, editing several files (which are checked into github, so really shouldn't be modified), and finally running a shell script. And even this may not be enough: after doing all this the result comes out as

Log output ``` sh: free: command not found Error in `[.data.table`(data.table::fread("free -h | grep Swap", header = FALSE), : Item 1 of j is 1 which is outside the column number range [1,ncol=0] Calls: [ -> [.data.table In addition: Warning message: In data.table::fread("free -h | grep Swap", header = FALSE) : File '/var/folders/d7/dw1pt7c114711zdyqf4gtg0h0000gn/T//RtmpFj596I/file38921ad8fd8d' has size 0. Returning a NULL data.table. Execution halted # Benchmark run 1580153381 started # Running benchmark scripts launcher starting: pydatatable groupby G1_1e7_1e2_0_0 finished: pydatatable groupby G1_1e7_1e2_0_0 starting: pydatatable groupby G1_1e7_1e1_0_0 finished: pydatatable groupby G1_1e7_1e1_0_0 starting: pydatatable groupby G1_1e7_2e0_0_0 finished: pydatatable groupby G1_1e7_2e0_0_0 starting: pydatatable groupby G1_1e7_1e2_0_1 finished: pydatatable groupby G1_1e7_1e2_0_1 starting: pydatatable groupby G1_1e8_1e2_0_0 finished: pydatatable groupby G1_1e8_1e2_0_0 starting: pydatatable groupby G1_1e8_1e1_0_0 finished: pydatatable groupby G1_1e8_1e1_0_0 starting: pydatatable groupby G1_1e8_2e0_0_0 finished: pydatatable groupby G1_1e8_2e0_0_0 starting: pydatatable groupby G1_1e8_1e2_0_1 finished: pydatatable groupby G1_1e8_1e2_0_1 starting: pydatatable groupby G1_1e9_1e2_0_0 finished: pydatatable groupby G1_1e9_1e2_0_0 starting: pydatatable groupby G1_1e9_1e1_0_0 finished: pydatatable groupby G1_1e9_1e1_0_0 starting: pydatatable groupby G1_1e9_2e0_0_0 finished: pydatatable groupby G1_1e9_2e0_0_0 starting: pydatatable groupby G1_1e9_1e2_0_1 finished: pydatatable groupby G1_1e9_1e2_0_1 # Benchmark run 1580153381 has been completed in 1s ```

which is difficult to understand (i.e. it is unclear where the results of the benchmark run are).

The "Single solution benchmark interactively" section has fewer steps, but it ends with cryptic remark "... run lines of script interactively", which again are unclear.


A more user-friendly approach would be to develop a simple wrapper utility script (perhaps written in python or R) that will have roughly the following functionality:

run --solution SOLUTION --task TASK --question QUESTION --datasize N --out FILE

Arguments:
--solution   Which solution to run, one of [...], or `*` to run them all. 
             This option can be specified multiple times.
--task          Either "groupby" (default) or "join"
--question   Which question to run, for example "1" or "3:4" or "2,4,5"
--datasize   Number of rows in the dataset, e.g. "1e7"
--out        Save the results into the specified csv file. If the option is not
             given, the results will be printed to console in human-readable form.

Notably, this script should not handle environments or software upgrades, or post-processing the results, or collecting them into a "historic record" file of publishing to GitHub pages, etc. All these tasks can be delegated to a higher-level script that will invoke the user-level run script with appropriate arguments, and then post-process its results.

jangorecki commented 4 years ago

Good idea. Altough it might be quite complicated to isolate as much stuff as you are asking.

st-pasha commented 4 years ago

Running multiple solutions at once would be problematic because it has to activate different virtual env for each of them. You are asking to take out venvs from there. Running one solution and 1+ task will be fine. If we want to run multiple solutions then we have to include venv switching logic inside.

Agree, multiple solutions is kinda overkill. I could always just run the script twice:

run --solution pandas
run --solution datatable

I would say human-readable form is not really high priority, new csv including headers should be enough. Simply because it is portable, otherwise we would need human-readable formats from R, py, julia, clickhouse and eventually other languages in future.

The run script calls all other solutions and then presents the results to the user. If all those other scripts are reporting their results in a unified csv format -- that's perfect. All run has to do is to read that csv and print it to the console in a user-friendly way.

Specifying datasize is not enough, "data name" would do, because "data name" encodes datasize, K cardinality, NAs percentage, unsorted/sorted.

I understand that there are many parameters that go into creating the input dataset, and they could have all been turned into their own flags (such as --nrows N --cardinality K --nafraction F --sorted). However, in terms of replicating the benchmark there is only one parameter: datasize, which is reported as either 0.5GB or 5GB or 50GB on the report page (https://h2oai.github.io/db-benchmark/). So, even if the benchmark uses different datasets for different questions -- these are all implementation details that the user shouldn't have to know about.

If you want, you can extend the framework by allowing the user to specify arbitrary parameters for nrows / cardinality / etc, and run the benchmark on such custom dataset -- however, this is clearly less of a priority than having a simple mechanism for replicating the results that are reported.

Specifying particular questions to run is not really feasible because they run in sequence. All scripts would have to escape each question.

I presume it's not too difficult though? For example, say run script sets up environment variables QUESTION1, QUESTION2, ..., QUESTION10 before invoking the solution, and inside the solution we simply check

if "QUESTION1" in os.environ:
    # run question 1 ...
if "QUESTION2" in os.environ:
    # run question 2 ...

I understand that this may involve modifying the scripts for each of the solution, however if this seems too hard, we may approach this in a non-committal fashion: the run script will tell the solution script which questions to run, and then the solution script may run either those specific questions, or all questions, or whatever questions it can.

jangorecki commented 4 years ago

To clarify, different factors (K cardinality, NA fraction, sorted) are all being used for all questions. It is the report that presents only a single slice over those dimensions (k=1e2, na=0, sorted=0). On the bottom of the report page there are links to all other slices.

I realized it might be also tricky to handle multiple tasks, or even data sizes at once, because of the current logging approach. Each single combination of all factors is a fresh instance executing benchmark script, and each such instance is logged in logs.csv file. It is because it may happens that solution will fail to complete even the very first question, then no logs would be written to time.csv, but we still want to know if there was a attempt to solve those questions. So for now it seems to be most reasonable to make the solution luncher requested by you to support only scalar arguments for all arguments (solution, task, size, etc).

jangorecki commented 4 years ago

requested feature merged to master already, all below works

./_launcher/solution.R --solution=data.table
./_launcher/solution.R --solution=dplyr
./_launcher/solution.R --solution=pandas
./_launcher/solution.R --solution=dask
./_launcher/solution.R --solution=pydatatable
./_launcher/solution.R --solution=juliadf
./_launcher/solution.R --solution=spark
./_launcher/solution.R --solution=cudf
./_launcher/solution.R --solution=clickhouse