segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
341 stars 58 forks source link

Control row group size #375

Closed yonesko closed 2 years ago

yonesko commented 2 years ago

Hello, I haven't found how to control row group size.

Yes, I can call Flush, but how do I know if row group reached limit size (1GB for example) ?

achille-roussel commented 2 years ago

Hello @yonesko

There is currently no control of the row group size in bytes. Since parquet columns are encoded and compressed, I would like to ask what size you would need to control: would it be the compressed size of the row group on disk, or the total decoded size?

yonesko commented 2 years ago

Compressed size of row group

achille-roussel commented 2 years ago

Do you mind providing a bit more context on the use case that would require controlling the size of a row group on disk?

yonesko commented 2 years ago

We have a big (4GB) parquet file with one row group, and Amazon Athena fails to read with "GENERIC_INTERNAL_ERROR: integer overflow" We can limit RG by rows number and error disappeared