h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
323 stars 85 forks source link

ClickHouse table engine #91

Closed jangorecki closed 4 years ago

jangorecki commented 5 years ago

ClickHouse has been already added to benchmark script and report. There are some pending items to close #73 fully. New question appeared, precisely, what table engine should be used, there are generally two options to consider:


Although it is happening that merge tree table engine yields faster query execution time than memory engine (see timings below). Moreover all 1e9 rows (50GB) datasets, as of current CH version, are failing when writing data to in-memory tables.

Code: 210. DB::NetException: Connection reset by peer, while writing to socket (
127.0.0.1:9000)

We could say "as documented" because 1e9 rows is 10 times more than docs suggests for in-memory tables.


As of now we do run benchmarks on both table engines. For a memory table engine we use G1 data file prefix. For merge tree table engine we use G2 prefix (extra 1:N column has been added, as required by merge tree). Timings of both types of table engines lands in time.csv. What lands on the benchplot on report page is the in-memory timing, unless there is no, then merge tree timing. Logic for that can be found in report.R#L141-L164.


The question is, should we switch to merge tree timings only?

cc @alexey-milovidov

jangorecki commented 5 years ago
source("report.R")
download.file("https://h2oai.github.io/db-benchmark/time.csv", "time.csv")
download.file("https://h2oai.github.io/db-benchmark/logs.csv", "logs.csv")
ld = time_logs()
## note that due to change of data (new benchmark timings) below query might not be functional, filtering time.csv and logs.csv for a timestamp of the post will allow to reproduce below numbers
ld[script_recent==TRUE & solution=="clickhouse" & question_group=="basic" & !(substr(data, 1, 2)=="G1" & engine=="mergetree"), .(t1=time_sec_1, t2=time_sec_2),, .(engine, in_rows, k, sorted, iquestion, question)
   ][, dcast(.SD, in_rows+k+sorted+iquestion+question~engine, value.var=c("t1","t2"))
     ][, .(in_rows, k, sorted, iquestion, question, t1_memory, t1_mergetree, t1_mem2mt=t1_memory/t1_mergetree, t2_memory, t2_mergetree, t2_mem2mt=t2_memory/t2_mergetree)
       ] -> dd
print(dd, row.names=FALSE)
dd[, .(t1_mean_ratio=mean(t1_mem2mt), t2_mean_ratio=mean(t2_mem2mt)), .(in_rows, k, sorted)
   ][, print(.SD, row.names=FALSE)] -> nul

in_rows, k, sorted are dataset characteristics question - query t1 - first query run timing in seconds t2 - second mem2mt - memory_time / mergetree_time (larger than 1.0 means memory is slower than mergretree)

 in_rows   k   sorted iquestion              question t1_memory t1_mergetree t1_mem2mt t2_memory t2_mergetree t2_mem2mt
     1e7 1e2 unsorted         1         sum v1 by id1     0.064        0.132 0.4848485     0.064        0.132 0.4848485
     1e7 1e2 unsorted         2     sum v1 by id1:id2     0.144        0.160 0.9000000     0.144        0.160 0.9000000
     1e7 1e2 unsorted         3 sum v1 mean v3 by id3     0.202        0.192 1.0520833     0.190        0.170 1.1176471
     1e7 1e2 unsorted         4     mean v1:v3 by id4     0.030        0.049 0.6122449     0.030        0.049 0.6122449
     1e7 1e2 unsorted         5      sum v1:v3 by id6     0.207        0.127 1.6299213     0.185        0.127 1.4566929
     1e7 1e2   sorted         1         sum v1 by id1     0.031        0.034 0.9117647     0.031        0.034 0.9117647
     1e7 1e2   sorted         2     sum v1 by id1:id2     0.112        0.076 1.4736842     0.112        0.076 1.4736842
     1e7 1e2   sorted         3 sum v1 mean v3 by id3     0.237        0.164 1.4451220     0.197        0.163 1.2085890
     1e7 1e2   sorted         4     mean v1:v3 by id4     0.032        0.051 0.6274510     0.032        0.048 0.6666667
     1e7 1e2   sorted         5      sum v1:v3 by id6     0.204        0.122 1.6721311     0.228        0.119 1.9159664
     1e7 1e1 unsorted         1         sum v1 by id1     0.048        0.051 0.9411765     0.048        0.051 0.9411765
     1e7 1e1 unsorted         2     sum v1 by id1:id2     0.285        0.147 1.9387755     0.285        0.172 1.6569767
     1e7 1e1 unsorted         3 sum v1 mean v3 by id3     0.932        0.340 2.7411765     1.075        0.340 3.1617647
     1e7 1e1 unsorted         4     mean v1:v3 by id4     0.024        0.050 0.4800000     0.024        0.050 0.4800000
     1e7 1e1 unsorted         5      sum v1:v3 by id6     0.702        0.242 2.9008264     0.640        0.242 2.6446281
     1e7 2e0 unsorted         1         sum v1 by id1     0.051        0.051 1.0000000     0.051        0.051 1.0000000
     1e7 2e0 unsorted         2     sum v1 by id1:id2     0.306        0.154 1.9870130     0.306        0.183 1.6721311
     1e7 2e0 unsorted         3 sum v1 mean v3 by id3     1.358        0.686 1.9795918     1.383        0.666 2.0765766
     1e7 2e0 unsorted         4     mean v1:v3 by id4     0.034        0.051 0.6666667     0.034        0.051 0.6666667
     1e7 2e0 unsorted         5      sum v1:v3 by id6     1.014        0.501 2.0239521     0.979        0.549 1.7832423
     1e8 1e2 unsorted         1         sum v1 by id1     0.474        0.404 1.1732673     0.474        0.404 1.1732673
     1e8 1e2 unsorted         2     sum v1 by id1:id2     0.792        0.840 0.9428571     0.792        0.771 1.0272374
     1e8 1e2 unsorted         3 sum v1 mean v3 by id3     2.716        1.547 1.7556561     2.189        1.449 1.5106970
     1e8 1e2 unsorted         4     mean v1:v3 by id4     0.116        0.284 0.4084507     0.116        0.284 0.4084507
     1e8 1e2 unsorted         5      sum v1:v3 by id6     1.981        1.465 1.3522184     1.910        1.479 1.2914131
     1e8 1e2   sorted         1         sum v1 by id1     0.265        0.228 1.1622807     0.265        0.228 1.1622807
     1e8 1e2   sorted         2     sum v1 by id1:id2     0.681        0.586 1.1621160     0.482        0.564 0.8546099
     1e8 1e2   sorted         3 sum v1 mean v3 by id3     2.650        1.694 1.5643447     2.205        1.659 1.3291139
     1e8 1e2   sorted         4     mean v1:v3 by id4     0.199        0.361 0.5512465     0.098        0.349 0.2808023
     1e8 1e2   sorted         5      sum v1:v3 by id6     1.935        1.589 1.2177470     1.797        1.646 1.0917375
     1e8 1e1 unsorted         1         sum v1 by id1     0.431        0.375 1.1493333     0.431        0.375 1.1493333
     1e8 1e1 unsorted         2     sum v1 by id1:id2     1.295        1.408 0.9197443     1.237        1.206 1.0257048
     1e8 1e1 unsorted         3 sum v1 mean v3 by id3     4.649        3.764 1.2351222     4.377        4.004 1.0931568
     1e8 1e1 unsorted         4     mean v1:v3 by id4     0.156        0.326 0.4785276     0.100        0.326 0.3067485
     1e8 1e1 unsorted         5      sum v1:v3 by id6     3.188        2.706 1.1781227     3.081        2.810 1.0964413
     1e8 2e0 unsorted         1         sum v1 by id1     0.372        0.311 1.1961415     0.154        0.292 0.5273973
     1e8 2e0 unsorted         2     sum v1 by id1:id2     1.052        1.249 0.8422738     1.557        1.315 1.1840304
     1e8 2e0 unsorted         3 sum v1 mean v3 by id3     8.826        6.453 1.3677359     7.618        7.605 1.0017094
     1e8 2e0 unsorted         4     mean v1:v3 by id4     0.226        0.405 0.5580247     0.132        0.405 0.3259259
     1e8 2e0 unsorted         5      sum v1:v3 by id6     5.194        4.918 1.0561204     5.098        4.996 1.0204163
     1e9 1e2 unsorted         1         sum v1 by id1        NA       66.769        NA        NA        2.468        NA
     1e9 1e2 unsorted         2     sum v1 by id1:id2        NA       29.814        NA        NA        6.945        NA
     1e9 1e2 unsorted         3 sum v1 mean v3 by id3        NA      101.174        NA        NA       24.452        NA
     1e9 1e2 unsorted         4     mean v1:v3 by id4        NA       75.207        NA        NA        2.591        NA
     1e9 1e2 unsorted         5      sum v1:v3 by id6        NA       28.183        NA        NA       21.082        NA
     1e9 1e2   sorted         1         sum v1 by id1        NA        3.043        NA        NA        1.588        NA
     1e9 1e2   sorted         2     sum v1 by id1:id2        NA        4.593        NA        NA        4.107        NA
     1e9 1e2   sorted         3 sum v1 mean v3 by id3        NA       32.187        NA        NA       24.275        NA
     1e9 1e2   sorted         4     mean v1:v3 by id4        NA        4.768        NA        NA        2.618        NA
     1e9 1e2   sorted         5      sum v1:v3 by id6        NA       23.256        NA        NA       22.731        NA
     1e9 1e1 unsorted         1         sum v1 by id1        NA       84.982        NA        NA        2.433        NA
     1e9 1e1 unsorted         2     sum v1 by id1:id2        NA       24.744        NA        NA       15.275        NA
     1e9 1e1 unsorted         3 sum v1 mean v3 by id3        NA      171.726        NA        NA       93.038        NA
     1e9 1e1 unsorted         4     mean v1:v3 by id4        NA       59.188        NA        NA        2.933        NA
     1e9 1e1 unsorted         5      sum v1:v3 by id6        NA       62.258        NA        NA       37.631        NA
     1e9 2e0 unsorted         1         sum v1 by id1        NA           NA        NA        NA           NA        NA
     1e9 2e0 unsorted         2     sum v1 by id1:id2        NA           NA        NA        NA           NA        NA
     1e9 2e0 unsorted         3 sum v1 mean v3 by id3        NA           NA        NA        NA           NA        NA
     1e9 2e0 unsorted         4     mean v1:v3 by id4        NA           NA        NA        NA           NA        NA
     1e9 2e0 unsorted         5      sum v1:v3 by id6        NA           NA        NA        NA           NA        NA
 in_rows   k   sorted iquestion              question t1_memory t1_mergetree t1_mem2mt t2_memory t2_mergetree t2_mem2mt
 in_rows   k   sorted t1_mean_ratio t2_mean_ratio
     1e7 1e2 unsorted     0.9358196     0.9142867
     1e7 1e2   sorted     1.2260306     1.2353342
     1e7 1e1 unsorted     1.8003910     1.7769092
     1e7 2e0 unsorted     1.5314447     1.4397233
     1e8 1e2 unsorted     1.1264899     1.0822131
     1e8 1e2   sorted     1.1315470     0.9437089
     1e8 1e1 unsorted     0.9921700     0.9342769
     1e8 2e0 unsorted     1.0040593     0.8118959
     1e9 1e2 unsorted            NA            NA
     1e9 1e2   sorted            NA            NA
     1e9 1e1 unsorted            NA            NA
     1e9 2e0 unsorted            NA            NA

conclusions:

jangorecki commented 5 years ago

Another clickhouse benchmark run happened in the meantime. memory engine to mergetree engine ratio from the recent run

 in_rows   k   sorted t1_mean_ratio t2_mean_ratio
     1e7 1e2 unsorted     1.3361793     1.4951925
     1e7 1e2   sorted     1.4447885     1.3880568
     1e7 1e1 unsorted     2.0377858     1.6922132
     1e7 2e0 unsorted     1.7449527     1.4594917
     1e8 1e2 unsorted     2.0234962     0.8623157
     1e8 1e2   sorted     1.1649341     1.0058074
     1e8 1e1 unsorted     1.0036960     0.8620307
     1e8 2e0 unsorted     0.9284681     0.8583559
     1e9 1e2 unsorted            NA            NA
     1e9 1e2   sorted            NA            NA
     1e9 1e1 unsorted            NA            NA
     1e9 2e0 unsorted            NA            NA

Leads to conclusion that mergetree is generally faster than memory. Current approach to present memory timings by default, and mergetree when memory timings are not available, doesn't seem to be fair.

jangorecki commented 4 years ago

I think we had this issue open long enough. If needed it will be re-opened.

jangorecki commented 3 years ago

It is no longer required to specify primary key for mergetree tables, thus we don't need to maintain a workaround duplicating all csv files just to put extra PK column there, also described in #137. Therefore CH is now reading the same CSV files as other solutions. As a result G1 prefix is now being used for mergetree engine. This only affects new timings, all historical timings in time.csv will have proper value of engine field.