h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
323 stars 85 forks source link

groupby datagen - measure variables could be more unique #93

Closed jangorecki closed 3 years ago

jangorecki commented 5 years ago

The original grouping benchmark was generating a floating point number column in a fast way, but sacrificing its uniqueness.

v3 =  sample(round(runif(100,max=100),4), N, TRUE)

We could improve that. This will require to re-generate all data. Timings before and after has to be included. Change to datagen script is already made but data were not regenerated.

jangorecki commented 5 years ago

Additionally range of integer measures v1 and v2 could be extended. As of now there are only 5 distinct values. That does not well represents am integer measure variable. I suggest to use range of v1 [1,N/1e6] and v2 [1,N/1e3]. So for

In the latter case, 1e7 data, v1 is then not very far from the current state [1,5].

jangorecki commented 3 years ago

Resolved by 9ce7cc802c9567477ab8a07d82a62bcee59572e1 ffbb3e43706696f657e1e9f3bb7e140e4430d2aa 4fb8646feae266b4d7c83031442a03fdcfe531f3

Average of new/old timings

      solution   new/old
1:  data.table 0.9530879
2: pydatatable 0.9811307
3:       dplyr 1.0115221
4:      pandas 1.0223014
5:       spark 1.0154845
6:        dask 1.0538153
7:     juliadf 0.9677368
8:        cudf 0.9660064
9:  clickhouse 1.4063286

Here clickhouse seems to suffer.

              data                    question   solution    task     old     new
 1: G1_1e7_1e2_0_0       sum v1 mean v3 by id3 clickhouse groupby   0.146   0.185
 2: G1_1e7_1e2_0_0            sum v1:v3 by id6 clickhouse groupby   0.127   0.173
 3: G1_1e7_1e2_0_0  median v3 sd v3 by id4 id5 clickhouse groupby   0.158   0.183
 4: G1_1e7_1e2_0_0      max v1 - min v2 by id3 clickhouse groupby   0.152   0.191
 5: G1_1e7_1e2_0_0       largest two v3 by id6 clickhouse groupby   0.745   0.849
 6: G1_1e7_1e2_0_0 regression v1 v2 by id2 id4 clickhouse groupby   0.189   0.223
 7: G1_1e7_1e1_0_0               sum v1 by id1 clickhouse groupby   0.067   0.081
 8: G1_1e7_1e1_0_0           sum v1 by id1:id2 clickhouse groupby   0.100   0.133
 9: G1_1e7_2e0_0_0               sum v1 by id1 clickhouse groupby   0.074   0.093
10: G1_1e7_2e0_0_0            sum v1:v3 by id6 clickhouse groupby   0.386   0.432
11: G1_1e7_1e2_0_1       sum v1 mean v3 by id3 clickhouse groupby   0.143   0.160
12: G1_1e7_1e2_0_1       largest two v3 by id6 clickhouse groupby   0.769   0.849
13: G1_1e8_1e2_0_0               sum v1 by id1 clickhouse groupby   0.306   1.341
14: G1_1e8_1e2_0_0           sum v1 by id1:id2 clickhouse groupby   0.950   1.835
15: G1_1e8_1e2_0_0       sum v1 mean v3 by id3 clickhouse groupby   1.140   2.246
16: G1_1e8_1e2_0_0           mean v1:v3 by id4 clickhouse groupby   0.282   2.066
17: G1_1e8_1e2_0_0            sum v1:v3 by id6 clickhouse groupby   1.113   3.633
18: G1_1e8_1e2_0_0  median v3 sd v3 by id4 id5 clickhouse groupby   2.265   2.494
19: G1_1e8_1e1_0_0               sum v1 by id1 clickhouse groupby   0.274   1.171
20: G1_1e8_1e1_0_0           sum v1 by id1:id2 clickhouse groupby   0.753   1.844
21: G1_1e8_1e1_0_0       sum v1 mean v3 by id3 clickhouse groupby   2.583   3.585
22: G1_1e8_1e1_0_0           mean v1:v3 by id4 clickhouse groupby   1.258   1.828
23: G1_1e8_1e1_0_0  median v3 sd v3 by id4 id5 clickhouse groupby   0.621   2.179
24: G1_1e8_2e0_0_0       sum v1 mean v3 by id3 clickhouse groupby   4.257   5.792
25: G1_1e8_2e0_0_0           mean v1:v3 by id4 clickhouse groupby   0.417   1.333
26: G1_1e8_2e0_0_0            sum v1:v3 by id6 clickhouse groupby   3.079   4.205
27: G1_1e8_1e2_0_1           mean v1:v3 by id4 clickhouse groupby   0.273   1.339
28: G1_1e8_1e2_0_1            sum v1:v3 by id6 clickhouse groupby   1.124   2.356
29: G1_1e8_1e2_0_1  median v3 sd v3 by id4 id5 clickhouse groupby   1.146   2.564
30: G1_1e8_1e2_0_1      max v1 - min v2 by id3 clickhouse groupby   1.264   2.212
31: G1_1e8_1e2_0_1       largest two v3 by id6 clickhouse groupby   7.962   9.727
32: G1_1e9_1e2_0_0       sum v1 mean v3 by id3 clickhouse groupby  51.843  84.889
33: G1_1e9_1e2_0_0           mean v1:v3 by id4 clickhouse groupby  42.455  61.378
34: G1_1e9_1e2_0_0  median v3 sd v3 by id4 id5 clickhouse groupby  33.230  37.604
35: G1_1e9_1e2_0_0 regression v1 v2 by id2 id4 clickhouse groupby  35.743  55.963
36: G1_1e9_1e1_0_0               sum v1 by id1 clickhouse groupby   4.188  23.292
37: G1_1e9_1e1_0_0           sum v1 by id1:id2 clickhouse groupby   7.709  13.295
38: G1_1e9_1e1_0_0       sum v1 mean v3 by id3 clickhouse groupby 179.184 236.698
39: G1_1e9_1e1_0_0           mean v1:v3 by id4 clickhouse groupby  11.492  21.022
40: G1_1e9_2e0_0_0       sum v1 mean v3 by id3 clickhouse groupby 316.015 363.193
41: G1_1e9_2e0_0_0           mean v1:v3 by id4 clickhouse groupby   6.845  26.624
42: G1_1e9_1e2_0_1               sum v1 by id1 clickhouse groupby   4.285   9.157
43: G1_1e9_1e2_0_1           sum v1 by id1:id2 clickhouse groupby   5.821   6.454
44: G1_1e9_1e2_0_1       sum v1 mean v3 by id3 clickhouse groupby  16.959  41.066
45: G1_1e9_1e2_0_1  median v3 sd v3 by id4 id5 clickhouse groupby  11.747  30.925
46: G1_1e9_1e2_0_1      max v1 - min v2 by id3 clickhouse groupby  15.357  64.714
47: G1_1e9_1e2_0_1       largest two v3 by id6 clickhouse groupby  85.972 143.539
48: G1_1e9_1e2_0_1 regression v1 v2 by id2 id4 clickhouse groupby  19.392  62.445
              data                    question   solution    task     old     new