ironSource / parquetjs

fully asynchronous, pure JavaScript implementation of the Parquet file format
MIT License
346 stars 175 forks source link

ENUM types unsupported #72

Open lqueryvg opened 6 years ago

lqueryvg commented 6 years ago

I'm trying to read a file which contains ENUM types, however (e.g. parquet-tools schema shows something like required binary entryMethod (ENUM);) but parquetjs.ParquetReader.openFile() just throws an error:

Invalid ENUM value

... which means I can't use this package to read my files.

What are my options ? Are there any plans to support this type ?

Thanks,

lqueryvg commented 6 years ago

FYI I've posted a very similar problem to... https://github.com/kbajalc/parquets/issues/3 because both projects seem to suffer the same problem.

ZJONSSON commented 6 years ago

Do you have a sample parquet file that fails?

lqueryvg commented 6 years ago

@ZJONSSON, nothing that I can share right now ... Sensitive data and all that.

For now I can share more output from parquet-tools - if that might help.

creator:                parquet-mr version 1.9.0 (build 38262e2c80015d0935dad20f8e18f2d6f9fbd03c)
<snip>
extra:                  writer.model.name = avro
<snip>
file schema:            <redacted>
--------------------------------------------------------------------------------
<redacted>
..entryMethod:          REQUIRED BINARY O:ENUM R:1 D:1
<snip>
row group 1:            RC:100000 TS:74883019 OFFSET:4
--------------------------------------------------------------------------------
<snip>
..entryMethod:           BINARY SNAPPY DO:0 FPO:39740111 SZ:227406/230832/1.02 VC:747359 ENC:RLE,PLAIN_DICTIONARY ST:[no stats for this column]
<snip>

I can try to find out how the file was created. Maybe see how to rustle one up.

lqueryvg commented 6 years ago

Also FYI, when I look at the values for the ENUM columns, I can see that they are base 64 encoded strings.

Quote from https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#enum

ENUM ENUM annotates the binary primitive type and indicates that the value was converted from an enumerated type in another data model (e.g. Thrift, Avro, Protobuf). Applications using a data model lacking a native enum type should interpret ENUM annotated field as a UTF-8 encoded string.

The sort order used for ENUM values is unsigned byte-wise comparison.

ZJONSSON commented 6 years ago

Thank you - please see if you can create a simple example that you can share? That way I can take a look and see if there is an easy fix!

kyleboyer-optum commented 5 years ago

Any updates?

lqueryvg commented 5 years ago

Hi, I have to be honest and say that I don't think I'm ever going to be spend the time needed to reproduce this.

I've decided to AWS S3 Select to extract the data I need from my parquet files.

Thanks, and sorry if this has wasted anyone's time.

kyleboyer-optum commented 5 years ago

@lqueryvg All good, I'm having the same issue right now and was wondering if @ZJONSSON or others had a chance to work with ENUMS?

Abhilash-Potharaju commented 5 years ago

could anyone please let me know how to generate logical type "DECIMAL" ?