RowGroup size recommendation is too low for optimal use of Parquet

ironSource / parquetjs

fully asynchronous, pure JavaScript implementation of the Parquet file format

MIT License

351 stars 174 forks source link

From the README:

var writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');
writer.setRowGroupSize(8192);

This is way off compared to the intended size of RowGroups for Parquet files. The initial implementation suggests to use 128MiB as the RowGroup size to neatly fit a HDFS block.

Using very tiny RowGroup sizes removes the main benefit of the Parquet format, being columnar, utilising vectorized execution and a good trade-off between compression ratios and CPU usage due to encoding the data with the knowledge of its data type.

The smallest unit in a Parquet file, a page, is normally set to 1 MiB which is much more than 200x the recommended RowGroup size. Some implementations have used 64KiB which is also greater.

Note that the value is currently not specified in bytes, but as a number of rows. Assuming an average row size of 4k, the current default row group size would come out to ~16MiB.

I think 128MiB in a single row group could be a tad too much for node.js, but we could definitely try increasing the default. However just today somebody opened a bug report because they ran into an issue where apparently the default was too high for them (due to much larger rows I believe).

I think the proper long-term solution would be to allow the user to specify the limit in bytes. However that would require some larger changes to the code or us adding a second code path to "estimate" the record size before the actual encoding happens.

ironSource / parquetjs

RowGroup size recommendation is too low for optimal use of Parquet #61