voltrondata-labs / arrowbench

R package for benchmarking
Other
13 stars 9 forks source link

TPC-H benchmarks should create parquet/feather files with multiple row groups #34

Open westonpace opened 3 years ago

westonpace commented 3 years ago

Right now, when generating TPC-H data, one huge row group / record batch is created with all of the data. Arrow should be able to handle that "ok" but it doesn't right now and that is perhaps not as realistic a scenario. Perhaps group the data into row groups of size 1M. The writers should have options to control row group / record batch size even if the input to the writer is one huge table.