h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
326 stars 88 forks source link

Error running benchmark for datatable #89

Closed st-pasha closed 5 years ago

st-pasha commented 5 years ago

In run.conf I specify to run the benchmark for datatable only:

# task, used in init-setup-iteration.R
export RUN_TASKS="groupby" # join sort read"

# solution, used in init-setup-iteration.R
export RUN_SOLUTIONS="pydatatable"

# not run benchmarks but print what would run and what skipped
export MOCKUP=false

# print csv entries to console, uses when writing timings to csv
export CSV_VERBOSE=false

# flag to upgrade tools, used in run.sh on init
export DO_UPGRADE=false

# force run, ignore if same version was run already
export FORCE_RUN=true

# flag to build reports, used in ruh.sh before publish
export DO_REPORT=false

# flag to publish, used in ruh.sh before exit
export DO_PUBLISH=false

Still, when running run.sh the error is returned related to missing clickhouse client:

$ ./run.sh
Unexpected return code from clickhouse-client: 127
Error: '\.' is an unrecognized escape in character string starting ""[^0-9\."
Execution halted
# Benchmark run 1561766743 started
./versions.sh: line 5: clickhouse-client: command not found
Error in read.dcf(system.file(package = "dplyr", "DESCRIPTION"), fields = c("Version",  : 
  cannot open the connection
In addition: Warning message:
In read.dcf(system.file(package = "dplyr", "DESCRIPTION"), fields = c("Version",  :
  cannot open compressed file '', probable reason 'No such file or directory'
Execution halted
# Benchmark run 1561766743 failed to check versions of currently installed solutions

What is the proper way to run a single solution?

jangorecki commented 5 years ago

this is the proper way to run a single solution, unfortunately clickhouse is not yet escaped nicely

jangorecki commented 5 years ago

@st-pasha please retry on latest master

jangorecki commented 5 years ago

@st-pasha any update on this?

st-pasha commented 5 years ago

Apologies, I missed your previous comment somehow.

With latest master I no longer see any clickhouse-related problems:

Error: '\.' is an unrecognized escape in character string starting ""[^0-9\."
Execution halted
# Benchmark run 1564421368 started
starting: pydatatable groupby G1_1e7_1e2_0_0
/bin/bash: out/run_pydatatable_groupby_G1_1e7_1e2_0_0.out: No such file or directory
finished: pydatatable groupby G1_1e7_1e2_0_0
starting: pydatatable groupby G1_1e7_1e1_0_0
/bin/bash: out/run_pydatatable_groupby_G1_1e7_1e1_0_0.out: No such file or directory
finished: pydatatable groupby G1_1e7_1e1_0_0
starting: pydatatable groupby G1_1e7_2e0_0_0
/bin/bash: out/run_pydatatable_groupby_G1_1e7_2e0_0_0.out: No such file or directory
finished: pydatatable groupby G1_1e7_2e0_0_0
starting: pydatatable groupby G1_1e7_1e2_0_1
/bin/bash: out/run_pydatatable_groupby_G1_1e7_1e2_0_1.out: No such file or directory
finished: pydatatable groupby G1_1e7_1e2_0_1
# Benchmark run 1564421368 has been completed in 1s

For the first error, my guess is that bash "eats" one level of escaping, so R sees only \. which is not a proper escape. An easy way to fix this is to remove backslashes altogether, since in regex language a dot inside square brackets is always interpreted literally. So, after doing that and running the command in R I get:

Error in `[.data.table`(data.table::fread("free -h | grep Swap", header = FALSE),  : 
  Item 1 of j is 1 which is outside the column number range [1,ncol=0]
In addition: Warning message:
In data.table::fread("free -h | grep Swap", header = FALSE) :
  File '/var/folders/d7/dw1pt7c114711zdyqf4gtg0h0000gn/T//RtmpVh7i9Q/file85061dade4dc' has size 0. Returning a NULL data.table.

Running just the first fread command returns:

> data.table::fread("free -h | grep Swap", header=FALSE)
sh: free: command not found
Null data.table (0 rows and 0 cols)
Warning message:
In data.table::fread("free -h | grep Swap", header = FALSE) :
  File '/var/folders/d7/dw1pt7c114711zdyqf4gtg0h0000gn/T//RtmpVh7i9Q/file850638c36bd' has size 0. Returning a NULL data.table.

So the actual issue is that my shell doesn't have the free command line utility, yet somehow data.table gobbles that error and issues a warning instead.


Still, despite the errors above the benchmark runs, producing some more error messages:

starting: pydatatable groupby G1_1e7_1e2_0_0
/bin/bash: out/run_pydatatable_groupby_G1_1e7_1e2_0_0.out: No such file or directory
finished: pydatatable groupby G1_1e7_1e2_0_0

I don't know what was supposed to be printed here, but I was hoping for something similar to the benchmark chart:

Question 1 -- first run time -- second run time
Question 2 -- first run time -- second run time
...
jangorecki commented 5 years ago

Are you trying to use osx to run benchmark? It was designed having debian-compatible os in mind. Software that is used on our machine that runs benchmark:

GNU bash, version 4.3.48(1)-release (x86_64-pc-linux-gnu)
free from procps-ng 3.3.10

The last issue is I believe about missing out directory, will amend code to create it automatically if it doesn't exist.

Timings are landing in time.csv file, attempts of running scripts are landing in logs.csv. structure of timings is following:

question 1 -- first run time
question 1 -- second run time
question 2 -- first run time
question 2 -- second run time

which is later processed for reports to the structure you mentioned in https://github.com/h2oai/db-benchmark/blob/936c3a6aaaf3045b62e4c5b0e3a705a1a867f4e2/report.R#L68

please retry latest master, ideally after installing free

st-pasha commented 5 years ago

According to SO, the equivalent of free on MacOS is vm_stat, which reports things like this:

$ vm_stat
Mach Virtual Memory Statistics: (page size of 4096 bytes)
Pages free:                              208197.
Pages active:                           1478906.
Pages inactive:                          868832.
Pages speculative:                       107124.
Pages throttled:                              0.
Pages wired down:                        997248.
Pages purgeable:                           9437.
"Translation faults":               36699619531.
Pages copy-on-write:                  444929577.
Pages zero filled:                   5459321091.
Pages reactivated:                    487618793.
Pages purged:                          19600537.
File-backed pages:                       468271.
Anonymous pages:                        1986591.
Pages stored in compressor:             4044251.
Pages occupied by compressor:            533481.
Decompressions:                       196753666.
Compressions:                        1049140452.
Pageins:                               87517994.
Pageouts:                                129923.
Swapins:                              148769198.
Swapouts:                             370675952.

Now, disabling swap can be done (https://summercode.com/wiki/how-to-disable-or-enable-swapping-in-mac-os-x), but it seems mighty dangerous... However, since the check is optional (the script keeps running even if the check fails), I guess it's not that important.

This is the output that I'm currently getting:

sh: free: command not found
Error in `[.data.table`(data.table::fread("free -h | grep Swap", header = FALSE),  : 
  Item 1 of j is 1 which is outside the column number range [1,ncol=0]
Calls: [ -> [.data.table
In addition: Warning message:
In data.table::fread("free -h | grep Swap", header = FALSE) :
  File '/var/folders/d7/dw1pt7c114711zdyqf4gtg0h0000gn/T//Rtmp7qDc35/filef3461723cb91' has size 0. Returning a NULL data.table.
Execution halted
# Benchmark run 1564431516 started
starting: pydatatable groupby G1_1e7_1e2_0_0
finished: pydatatable groupby G1_1e7_1e2_0_0: stderr 5
starting: pydatatable groupby G1_1e7_1e1_0_0
finished: pydatatable groupby G1_1e7_1e1_0_0: stderr 5
starting: pydatatable groupby G1_1e7_2e0_0_0
finished: pydatatable groupby G1_1e7_2e0_0_0: stderr 5
starting: pydatatable groupby G1_1e7_1e2_0_1
finished: pydatatable groupby G1_1e7_1e2_0_1: stderr 5
# Benchmark run 1564431516 has been completed in 2s

At first it was complaining about # Benchmark run 1564431330 aborted. './data' directory does not exists, but that error disappeared after creating directory "data". I even copied the files "G11e7*" there, just in case. Still, some errors are produced in the printout above, and I can't figure out what they mean.

jangorecki commented 5 years ago

please include some out/*.err, note that data files are now named G1_1e7_1e2_0_0.csv, the old name did not have two extra zeros which stands for NA percentage and if data are ordered.

st-pasha commented 5 years ago

Ah, I see. The .err files complain about missing module "psutil" and "pandas". After installing those the script finally runs

jangorecki commented 5 years ago

if there are no other problems here, and you obtained timings from time.csv file then we can close this issue.

st-pasha commented 5 years ago

sure