Adds the ability to specify the number of row groups when creating parquet/arrow files

jonkeane commented 2 years ago

Thanks for this, this is amazing! I have a few thoughts that I'll push to this branch if you don't mind

jonkeane commented 2 years ago

Ok, I just pushed a few changes that will make this backwards compatible (e.g. not change any of our benchmarks without changing a parameter of some sort).

Though as I was doing this, I wonder if num_groups is actually the right approach here: all of our writers have chunk_size where you can tell what size of chunk to make but not how many. Would it make sense to match that instead and have a chunk_size argument that percolates down to our writers? Would having more chunks than cores (but at least as many chunks as cores) be just as good as having exactly as many chunks as cores?

I guess for the immediate need we have of trying to optimize numbers for queries, we could hardcode these chunk sizes that makes the number of chunks end up equally the number of cores (of course we would also want to factor in scale factor there...)

westonpace commented 2 years ago

You are absolutely right wrt chunk_size. I've converted this PR from num_groups to chunk_size.

voltrondata-labs / arrowbench

Adds the ability to specify the number of row groups when creating parquet/arrow files #72