h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
322 stars 85 forks source link

join data could be better distributed #154

Closed jangorecki closed 3 years ago

jangorecki commented 4 years ago

Currently a non-matching rows of RHS data (around 10%) is not matching because values are higher than the max matching values. https://github.com/h2oai/db-benchmark/blob/bef4ff937c802e5007a201d131b221ea54895132/_data/join-datagen.R#L133-L136 To improve join stress coverage those non-matching values should be mixed between matching ones, otherwise algorithms like sort-merge join can easily cut-off non-matching part.

jangorecki commented 3 years ago

Resolved by 3ff6d43029011841f6a0613231e60d89103ead96 235f339e381b7004f39dff55c1a96ccca37e84fd

These are average differences for each solution.

      solution   new/old
1:  data.table 0.9438163
2: pydatatable 1.2447575
3:       dplyr 0.9437487
4:      pandas 0.9835782
5:       spark 0.9408303
6:        dask 1.0567662
7:     juliadf 0.9882151
8:        cudf 0.9865482
9:  clickhouse 0.9534466

In general they look good. Python DT was on average 24% slower on new join data.