Background
Currently, there is a WriterOption that allows the user to specify how many rows are buffered, before flushing into a row group: MaxRowsPerRowGroup.
In traditional big data processing systems, it is believed that having a proper row group size could result in a better IO pattern (as stated here: https://parquet.apache.org/docs/file-format/configurations/). This is the case when a row group does not overlap with block boundaries (as in HDFS, for example) so that only one read request is required.
Even with object storage, which is more popular today, modern systems tend to assume there is a fixed row group size in number of bytes when a single parquet file is being processed in parallel.
One example is in Apache Spark, when starting a job, the input Relation will be broken down into partitions (the processing unit that is indivisble), where each partition has a size in bytes, as specified here - This means if we are trying to process a file of 1GB, by default there will be 8 partitions, each 128MB, covering a non-overlapping range of that file, and if row groups do not have a fixed size that is also 128MB, some of the partitions will have no input data at all.
Proposal
As described in Background, specifying row group size in number of rows will not address the need.
I would like to propose a new config parameter that allows the user to specify row group size in number of bytes.
The new parameter works along with the existing one, the row group gets flushed once either of the two is satisfied.
Let me know if this makes sense. I would love to work on this and create a PR once done.
Background Currently, there is a
WriterOption
that allows the user to specify how many rows are buffered, before flushing into a row group:MaxRowsPerRowGroup
.In traditional big data processing systems, it is believed that having a proper row group size could result in a better IO pattern (as stated here: https://parquet.apache.org/docs/file-format/configurations/). This is the case when a row group does not overlap with block boundaries (as in HDFS, for example) so that only one read request is required.
Even with object storage, which is more popular today, modern systems tend to assume there is a fixed row group size in number of bytes when a single parquet file is being processed in parallel.
One example is in Apache Spark, when starting a job, the input
Relation
will be broken down into partitions (the processing unit that is indivisble), where each partition has a size in bytes, as specified here - This means if we are trying to process a file of 1GB, by default there will be 8 partitions, each 128MB, covering a non-overlapping range of that file, and if row groups do not have a fixed size that is also 128MB, some of the partitions will have no input data at all.Proposal As described in Background, specifying row group size in number of rows will not address the need. I would like to propose a new config parameter that allows the user to specify row group size in number of bytes. The new parameter works along with the existing one, the row group gets flushed once either of the two is satisfied.
Let me know if this makes sense. I would love to work on this and create a PR once done.