ironSource / parquetjs

fully asynchronous, pure JavaScript implementation of the Parquet file format
MIT License
346 stars 175 forks source link

ENUM types unsupported #72

Open lqueryvg opened 6 years ago

lqueryvg commented 6 years ago

I'm trying to read a file which contains ENUM types, however (e.g. parquet-tools schema shows something like required binary entryMethod (ENUM);) but parquetjs.ParquetReader.openFile() just throws an error:

Invalid ENUM value

... which means I can't use this package to read my files.

What are my options ? Are there any plans to support this type ?


lqueryvg commented 6 years ago

FYI I've posted a very similar problem to... because both projects seem to suffer the same problem.

ZJONSSON commented 6 years ago

Do you have a sample parquet file that fails?

lqueryvg commented 6 years ago

@ZJONSSON, nothing that I can share right now ... Sensitive data and all that.

For now I can share more output from parquet-tools - if that might help.

creator:                parquet-mr version 1.9.0 (build 38262e2c80015d0935dad20f8e18f2d6f9fbd03c)
extra:         = avro
file schema:            <redacted>
..entryMethod:          REQUIRED BINARY O:ENUM R:1 D:1
row group 1:            RC:100000 TS:74883019 OFFSET:4
..entryMethod:           BINARY SNAPPY DO:0 FPO:39740111 SZ:227406/230832/1.02 VC:747359 ENC:RLE,PLAIN_DICTIONARY ST:[no stats for this column]

I can try to find out how the file was created. Maybe see how to rustle one up.

lqueryvg commented 6 years ago

Also FYI, when I look at the values for the ENUM columns, I can see that they are base 64 encoded strings.

Quote from

ENUM ENUM annotates the binary primitive type and indicates that the value was converted from an enumerated type in another data model (e.g. Thrift, Avro, Protobuf). Applications using a data model lacking a native enum type should interpret ENUM annotated field as a UTF-8 encoded string.

The sort order used for ENUM values is unsigned byte-wise comparison.

ZJONSSON commented 6 years ago

Thank you - please see if you can create a simple example that you can share? That way I can take a look and see if there is an easy fix!

kyleboyer-optum commented 5 years ago

Any updates?

lqueryvg commented 5 years ago

Hi, I have to be honest and say that I don't think I'm ever going to be spend the time needed to reproduce this.

I've decided to AWS S3 Select to extract the data I need from my parquet files.

Thanks, and sorry if this has wasted anyone's time.

kyleboyer-optum commented 5 years ago

@lqueryvg All good, I'm having the same issue right now and was wondering if @ZJONSSON or others had a chance to work with ENUMS?

Abhilash-Potharaju commented 5 years ago

could anyone please let me know how to generate logical type "DECIMAL" ?