Closed jangorecki closed 3 years ago
Resolved by 3ff6d43029011841f6a0613231e60d89103ead96 235f339e381b7004f39dff55c1a96ccca37e84fd
These are average differences for each solution.
solution new/old
1: data.table 0.9438163
2: pydatatable 1.2447575
3: dplyr 0.9437487
4: pandas 0.9835782
5: spark 0.9408303
6: dask 1.0567662
7: juliadf 0.9882151
8: cudf 0.9865482
9: clickhouse 0.9534466
In general they look good. Python DT was on average 24% slower on new join data.
d[task=="join" & solution=="pydatatable", .SD[mean(new/old, na.rm=TRUE)>1.1], .(data, question)]
data question solution task old new
1: J1_1e7_NA_0_0 small inner on int pydatatable join 20.588 34.509
2: J1_1e7_NA_0_0 medium inner on int pydatatable join 21.755 59.177
3: J1_1e7_NA_0_0 medium inner on factor pydatatable join 20.525 31.012
4: J1_1e8_NA_0_0 small inner on int pydatatable join 41.818 46.969
5: J1_1e8_NA_0_0 medium outer on int pydatatable join 21.559 32.610
6: J1_1e8_NA_0_0 big inner on int pydatatable join 52.241 99.700
Currently a non-matching rows of RHS data (around 10%) is not matching because values are higher than the max matching values. https://github.com/h2oai/db-benchmark/blob/bef4ff937c802e5007a201d131b221ea54895132/_data/join-datagen.R#L133-L136 To improve join stress coverage those non-matching values should be mixed between matching ones, otherwise algorithms like sort-merge join can easily cut-off non-matching part.