Closed jangorecki closed 4 years ago
Join data gen script got reworked heavily because it required to much memory. It now takes much longer time to execute, but it is possible to generate big size 1e9 rows datasets. Note it requires 150 GB of memory.
Additionally there was a minor change to data. String columns does not have leading zeros anymore, so id0000001
is now id1
. This was amended to align data sizes to 0.5 GB, 5 GB, 50 GB rather than 0.6 GB, 6 GB, 60 GB.
All assumptions listed above in this issue stays valid.
Design of datasets for a
join
task is pretty complex problem. This issue will list requirements defined so far. It will be also a place for a discussion if something can be improved.some of the assumptions:
this has to be valid also when joining on multiple fields, thus the RHS has to be partially a subset of LHS table.what is not included:
Related commit: https://github.com/h2oai/db-benchmark/commit/1e5dc4d1a603bf5a0b10bf7abc8df7c45478c4d4 Note that content of this post is going to be expanded.