join task data generation - Githubissues

h2oai / db-benchmark

reproducible benchmark of database-like ops

https://h2oai.github.io/db-benchmark

Mozilla Public License 2.0

322 stars 85 forks source link

join task data generation #106

Closed jangorecki closed 4 years ago

jangorecki commented 4 years ago

Design of datasets for a join task is pretty complex problem. This issue will list requirements defined so far. It will be also a place for a discussion if something can be improved.

some of the assumptions:

around 10% of the ID variables used in join will not be matching between tables - ~~this has to be valid also when joining on multiple fields, thus the RHS has to be partially a subset of LHS table.~~
RHS table must contain no duplicates in a key on which granularity table is made: small (N/1e6), medium (N/1e3), big (N).

what is not included:

data does not have NAs both in ID variables and measure variables
data are for now only unsorted
data to join to does not have duplicates in ID variables so no rows explosion occurs (except for a non-equi join question)
joining on a floating-point column

Related commit: https://github.com/h2oai/db-benchmark/commit/1e5dc4d1a603bf5a0b10bf7abc8df7c45478c4d4 Note that content of this post is going to be expanded.

jangorecki commented 4 years ago

Join data gen script got reworked heavily because it required to much memory. It now takes much longer time to execute, but it is possible to generate big size 1e9 rows datasets. Note it requires 150 GB of memory. Additionally there was a minor change to data. String columns does not have leading zeros anymore, so id0000001 is now id1. This was amended to align data sizes to 0.5 GB, 5 GB, 50 GB rather than 0.6 GB, 6 GB, 60 GB. All assumptions listed above in this issue stays valid.