join data could be better distributed

Resolved by 3ff6d43029011841f6a0613231e60d89103ead96 235f339e381b7004f39dff55c1a96ccca37e84fd

These are average differences for each solution.

      solution   new/old
1:  data.table 0.9438163
2: pydatatable 1.2447575
3:       dplyr 0.9437487
4:      pandas 0.9835782
5:       spark 0.9408303
6:        dask 1.0567662
7:     juliadf 0.9882151
8:        cudf 0.9865482
9:  clickhouse 0.9534466

In general they look good. Python DT was on average 24% slower on new join data.

It could have been benefiting from the cutting off non matching rows as explained in description of this issue.
Time variance described in https://github.com/h2oai/datatable/issues/2161 contributes 4% of that

also version of pyDT is different, so there could have been a regression somewhere there: https://github.com/h2oai/datatable/compare/a45cc503494571bfbf0feb00ece03ad0bab16dfc...31fbad7b471113e30f34de8cc2b321c76f926c32 - reported in https://github.com/h2oai/datatable/issues/2775

d[task=="join" & solution=="pydatatable", .SD[mean(new/old, na.rm=TRUE)>1.1], .(data, question)]
        data               question    solution task    old    new
1: J1_1e7_NA_0_0     small inner on int pydatatable join 20.588 34.509
2: J1_1e7_NA_0_0    medium inner on int pydatatable join 21.755 59.177
3: J1_1e7_NA_0_0 medium inner on factor pydatatable join 20.525 31.012
4: J1_1e8_NA_0_0     small inner on int pydatatable join 41.818 46.969
5: J1_1e8_NA_0_0    medium outer on int pydatatable join 21.559 32.610
6: J1_1e8_NA_0_0       big inner on int pydatatable join 52.241 99.700

h2oai / db-benchmark

join data could be better distributed #154