ironSource / parquetjs

fully asynchronous, pure JavaScript implementation of the Parquet file format
MIT License
346 stars 175 forks source link

RowGroup size recommendation is too low for optimal use of Parquet #61

Open xhochy opened 6 years ago

xhochy commented 6 years ago

From the README:

var writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');
writer.setRowGroupSize(8192);

This is way off compared to the intended size of RowGroups for Parquet files. The initial implementation suggests to use 128MiB as the RowGroup size to neatly fit a HDFS block.

Using very tiny RowGroup sizes removes the main benefit of the Parquet format, being columnar, utilising vectorized execution and a good trade-off between compression ratios and CPU usage due to encoding the data with the knowledge of its data type.

The smallest unit in a Parquet file, a page, is normally set to 1 MiB which is much more than 200x the recommended RowGroup size. Some implementations have used 64KiB which is also greater.

asmuth commented 6 years ago

Note that the value is currently not specified in bytes, but as a number of rows. Assuming an average row size of 4k, the current default row group size would come out to ~16MiB.

I think 128MiB in a single row group could be a tad too much for node.js, but we could definitely try increasing the default. However just today somebody opened a bug report because they ran into an issue where apparently the default was too high for them (due to much larger rows I believe).

I think the proper long-term solution would be to allow the user to specify the limit in bytes. However that would require some larger changes to the code or us adding a second code path to "estimate" the record size before the actual encoding happens.