ironSource / parquetjs

fully asynchronous, pure JavaScript implementation of the Parquet file format
MIT License
349 stars 176 forks source link

Parquet writer should default to writing pages with DataPageV1 for better compatibility with other Parquet readers #78

Open rdmello opened 5 years ago

rdmello commented 5 years ago

Hello,

I am investigating a parquet-cpp issue where the data in a file written by the parquetjs library is not readable by the parquet-cpp library. This also seems to be related to another open issue in the parquetjs library: https://github.com/ironSource/parquetjs/issues/75 .

This is happening since parquetjs writes data pages using the DataPageV2 format, which doesn't appear to have widespread support among most Parquet readers, like parquet-cpp. I have opened a pull request in parquet-cpp to improve its DataPageV2 support here.

I see that there is some logic in parquetjs' writer.js to write DataPageV1 pages instead, but this is only accessible through the ParquetEnvelopeWriter API. It would be better if the ParquetWriter class could also default to writing DataPageV1 pages to improve compatibility with other Parquet readers.

ZJONSSON commented 5 years ago

see also https://github.com/ZJONSSON/parquetjs/issues/24#issuecomment-416009322

dobesv commented 4 years ago

BigQuery also does not support data page v2