h2oai / datatable

A Python package for manipulating 2-dimensional tabular data structures
https://datatable.readthedocs.io
Mozilla Public License 2.0
1.81k stars 155 forks source link

big to big join timings not stable #2161

Open jangorecki opened 4 years ago

jangorecki commented 4 years ago

Pydatatable join can be very fast, but in case of big to big join the variance of timing is very big. Numeric columns presents unix epoch time of the benchmark run. All timings made on 1f81e5711b77f93494fa01379d8dd242e4b45cea. 1e9 timings are on-disk, while the others are in-memory. Numbers in seconds.

    in_rows               question 1572674172 1573178513 1573180283
 1:     1e7     small inner on int      0.253      0.237      0.195
 2:     1e7    medium inner on int      0.291      0.286      0.292
 3:     1e7    medium outer on int      0.099      0.105      0.100
 4:     1e7 medium inner on factor      0.355      0.329      0.354
 5:     1e7       big inner on int     12.246      4.596     11.247
 6:     1e8     small inner on int      2.051      2.009      1.982
 7:     1e8    medium inner on int      3.426      3.057      3.165
 8:     1e8    medium outer on int      1.297      1.386      1.287
 9:     1e8 medium inner on factor      4.132      4.226      4.226
10:     1e8       big inner on int     91.243     40.386     58.109
11:     1e9     small inner on int     35.511     36.573     36.716
12:     1e9    medium inner on int     44.874     40.499     45.474
13:     1e9    medium outer on int     15.163     15.463     16.067
14:     1e9 medium inner on factor    170.026    168.346    165.552
15:     1e9       big inner on int         NA         NA         NA

I don't think we have to do anything about that because even when it is slower, it is still quite fast, but reporting so it is known and documented in project repo.

st-pasha commented 4 years ago

Hmm, looks like the biggest variation is in "big inner on int" tests (rows 5 and 10)

jangorecki commented 4 years ago

Yes, it is big to big join where we join table of the same size, 90% of rows are matching

jangorecki commented 3 years ago

Other join queries have now also very unstable timings, possibly caused by #2775. For example q2 "medium inner on int": On 1e9 one time 622.36, 687.774, another time 1592.488, 1306.6. On 1e8 one time 152.617, 138.237 and another 505.987, 449.31. Using same source (b4f78fbbb7aeee1d22b56cc33f994b7b48d23765).