Open xhochy opened 6 years ago
Note that the value is currently not specified in bytes, but as a number of rows. Assuming an average row size of 4k, the current default row group size would come out to ~16MiB.
I think 128MiB in a single row group could be a tad too much for node.js, but we could definitely try increasing the default. However just today somebody opened a bug report because they ran into an issue where apparently the default was too high for them (due to much larger rows I believe).
I think the proper long-term solution would be to allow the user to specify the limit in bytes. However that would require some larger changes to the code or us adding a second code path to "estimate" the record size before the actual encoding happens.
From the README:
This is way off compared to the intended size of RowGroups for Parquet files. The initial implementation suggests to use 128MiB as the RowGroup size to neatly fit a HDFS block.
Using very tiny RowGroup sizes removes the main benefit of the Parquet format, being columnar, utilising vectorized execution and a good trade-off between compression ratios and CPU usage due to encoding the data with the knowledge of its data type.
The smallest unit in a Parquet file, a page, is normally set to 1 MiB which is much more than 200x the recommended RowGroup size. Some implementations have used 64KiB which is also greater.