ironSource / parquetjs

fully asynchronous, pure JavaScript implementation of the Parquet file format
MIT License
348 stars 175 forks source link

List/Map is not compatible with AWS Athena/Hive/PrestoDB #30

Open dg3feiko opened 6 years ago

dg3feiko commented 6 years ago

I generated a parquet file with parquet.js with data containing list and map, but the nested field is not readable by AWS Athena, which is based on PrestoDB. I checked other implementations and it seem this is the reason https://github.com/apache/parquet-mr/pull/411

Thank you for the great job all the same.

dg3feiko commented 6 years ago

this is the schema generated by parquet.js for a list of elements

{
  mylist:[{"foo":"abc", "bar":"abc"}, {"foo":"abc", "bar":"abc"} ]
}
message root {
  repeated group mylist {
    required binary foo (UTF8);
    required binary bar (UTF8);
  }
}

and expected schema for PrestoDB/Hive is

message root {
  required group mylist (LIST){
    repeated group list {
       required group element {
            required binary foo (UTF8);
            required binary bar (UTF8);
       }
    }
  }
}
shyim commented 6 years ago

Hey @dg3feiko, have you found a working solution for that problem?

ZJONSSON commented 6 years ago

@shyim @dg3feiko Did you check out the https://github.com/ironSource/parquetjs/issues/67 - might be related

shyim commented 6 years ago

I have installed your version like mentioned in the comment with

npm install zjonsson/parquetjs#07fb2fd8fc03bf2b57243531eaf91f2d60f5e460

Generated new files and copied that to the S3 bucket, still problems with the athena query..

ZJONSSON commented 6 years ago

there is also https://github.com/ironSource/parquetjs/pull/43 you could try to install a fork that has all my outstanding PRs here merged to master (including the 43)

npm install zjonsson/parquetjs
shyim commented 6 years ago

I can select simple fields in the first tier, but when i select a struct Athena crashes with message: HIVE_CURSOR_ERROR: Can not read value at 0 in block 0 with your latest fork

bwisitero commented 6 years ago

i used 0.8.0 to convert a flat json file to parquet. Verified that im able to write and read it back. Uploaded it to s3 and used glue to create the athena table. Im unable to query the data for some reason though, getting a GENERIC_INTERNAL_ERROR: 0 Anybody else using this converter for athena?

justinsoliz commented 5 years ago

I gave this a try recently in AWS with Athena + Presto using the latest from zjonsson/parquetjs.

Root level primitives worked but nested lists failed:

Expected LIST column column to only have one field, but has x fields

gbassan-br commented 5 years ago

I gave this a try recently in AWS with Athena + Presto using the latest from zjonsson/parquetjs.

Root level primitives worked but nested lists failed:

Expected LIST column column to only have one field, but has x fields

+1 Anyone with a answer?

ZJONSSON commented 4 years ago

So I encountered the same issue and spend some time getting it to work. Here is a solution that seems to work at least for my case of lists with structs: https://github.com/ZJONSSON/parquetjs/pull/34 Test case from parquetjs to Athena can be found here: https://github.com/ZJONSSON/parquetjs/blob/9cee1592ce41e8dbca088fa2330b48ceb2d1de1a/test/list.js