ironSource / parquetjs

fully asynchronous, pure JavaScript implementation of the Parquet file format
MIT License
345 stars 173 forks source link

Athena query error: Required field 'compressed_page_size' was not found #99

Closed arnabguptadev closed 4 years ago

arnabguptadev commented 4 years ago

First, thanks for a wonderful library.

I got this working mostly. But when trying to query the data generated using parquetJS using Athena, I am getting the following error:

HIVE_CURSOR_ERROR: can not read class parquet.format.PageHeader: Required field 'compressed_page_size' was not found in serialized data! Struct: PageHeader(type:null, uncompressed_page_size:3, compressed_page_size:0)

I am creating the Athena table like this:

CREATE EXTERNAL TABLE IF NOT EXISTS MyTable (
    ... columns...
) STORED AS PARQUET
LOCATION 's3://folder/to/data'
tblproperties ("parquet.compress"="SNAPPY")

Debugging locally, it does hit the line where these headers are written. So not sure where this is going wrong.

When creating the writer, I am passing opts as {compression: "SNAPPY"}

Can you please help with any pointers?

Regards, Arnab.

arnabguptadev commented 4 years ago

Sorry, figured out what was wrong. Had a bug in a custom stream implementation that was corrupting the data.