segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
341 stars 102 forks source link

Specifying row group size as in bytes #481

Open ty-sentio-xyz opened 1 year ago

ty-sentio-xyz commented 1 year ago

Background Currently, there is a WriterOption that allows the user to specify how many rows are buffered, before flushing into a row group: MaxRowsPerRowGroup.

In traditional big data processing systems, it is believed that having a proper row group size could result in a better IO pattern (as stated here: https://parquet.apache.org/docs/file-format/configurations/). This is the case when a row group does not overlap with block boundaries (as in HDFS, for example) so that only one read request is required.

Even with object storage, which is more popular today, modern systems tend to assume there is a fixed row group size in number of bytes when a single parquet file is being processed in parallel.

One example is in Apache Spark, when starting a job, the input Relation will be broken down into partitions (the processing unit that is indivisble), where each partition has a size in bytes, as specified here - This means if we are trying to process a file of 1GB, by default there will be 8 partitions, each 128MB, covering a non-overlapping range of that file, and if row groups do not have a fixed size that is also 128MB, some of the partitions will have no input data at all.

Proposal As described in Background, specifying row group size in number of rows will not address the need. I would like to propose a new config parameter that allows the user to specify row group size in number of bytes. The new parameter works along with the existing one, the row group gets flushed once either of the two is satisfied.

Let me know if this makes sense. I would love to work on this and create a PR once done.