h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
320 stars 85 forks source link

Join Data generation script gets stuck with e9 rows #238

Closed nmshafie1993 closed 2 years ago

nmshafie1993 commented 2 years ago

Hi there I am trying to generate data with the Join data generation script. It works very well with e8 and e7 but it is not with e9. It only generates a 5.77 GB file named J1_1e9_NA_0_0.csv and gets stuck at Writing 1e9 data batch 2 which eventually gets killed. Here is the output: Generate join data of 1e9 rows Producing keys for LHS and RHS data Producing LHS 1e9 data from keys Writing LHS 1e9 data J1_1e9_NA_0_0 Writing 1e9 data batch 1 Writing 1e9 data batch 2

I checked the storage and there is more than 250GB empty space and cleared the cache folder. I was able to generate all other datasets with the data generation script including groupby e9s. But I don't understand why this one does not work, do you have any solution for it? thanks

jangorecki commented 2 years ago

Hi, And how is your RAM going? What is the total amount? Could you observe RAM memory during writing process? I had to write in batches on H2O machine to overcome limited memory of the machine. Disk space was never an issue. If RAM memory is an issue for you then it easy to just make 20 batches instead of 10.

nmshafie1993 commented 2 years ago

Thank you. Is changing this line enough for making 20 batches instead of 10? is there anything else that needs to be changed?

jangorecki commented 2 years ago

You need to adjust subset of data as well, each iteration writes part of data into file. It is 2 lines below. Did you checked RAM memory usage?