h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
322 stars 85 forks source link

join task data generation #106

Closed jangorecki closed 4 years ago

jangorecki commented 4 years ago

Design of datasets for a join task is pretty complex problem. This issue will list requirements defined so far. It will be also a place for a discussion if something can be improved.

some of the assumptions:

what is not included:

Related commit: https://github.com/h2oai/db-benchmark/commit/1e5dc4d1a603bf5a0b10bf7abc8df7c45478c4d4 Note that content of this post is going to be expanded.

jangorecki commented 4 years ago

Join data gen script got reworked heavily because it required to much memory. It now takes much longer time to execute, but it is possible to generate big size 1e9 rows datasets. Note it requires 150 GB of memory. Additionally there was a minor change to data. String columns does not have leading zeros anymore, so id0000001 is now id1. This was amended to align data sizes to 0.5 GB, 5 GB, 50 GB rather than 0.6 GB, 6 GB, 60 GB. All assumptions listed above in this issue stays valid.