Open dg3feiko opened 6 years ago
this is the schema generated by parquet.js for a list of elements
{
mylist:[{"foo":"abc", "bar":"abc"}, {"foo":"abc", "bar":"abc"} ]
}
message root {
repeated group mylist {
required binary foo (UTF8);
required binary bar (UTF8);
}
}
and expected schema for PrestoDB/Hive is
message root {
required group mylist (LIST){
repeated group list {
required group element {
required binary foo (UTF8);
required binary bar (UTF8);
}
}
}
}
Hey @dg3feiko, have you found a working solution for that problem?
@shyim @dg3feiko Did you check out the https://github.com/ironSource/parquetjs/issues/67 - might be related
I have installed your version like mentioned in the comment with
npm install zjonsson/parquetjs#07fb2fd8fc03bf2b57243531eaf91f2d60f5e460
Generated new files and copied that to the S3 bucket, still problems with the athena query..
there is also https://github.com/ironSource/parquetjs/pull/43 you could try to install a fork that has all my outstanding PRs here merged to master (including the 43)
npm install zjonsson/parquetjs
I can select simple fields in the first tier, but when i select a struct Athena crashes with message: HIVE_CURSOR_ERROR: Can not read value at 0 in block 0 with your latest fork
i used 0.8.0 to convert a flat json file to parquet. Verified that im able to write and read it back. Uploaded it to s3 and used glue to create the athena table. Im unable to query the data for some reason though, getting a GENERIC_INTERNAL_ERROR: 0 Anybody else using this converter for athena?
I gave this a try recently in AWS with Athena + Presto using the latest from zjonsson/parquetjs
.
Root level primitives worked but nested lists failed:
Expected LIST column column to only have one field, but has x fields
I gave this a try recently in AWS with Athena + Presto using the latest from
zjonsson/parquetjs
.Root level primitives worked but nested lists failed:
Expected LIST column column to only have one field, but has x fields
+1 Anyone with a answer?
So I encountered the same issue and spend some time getting it to work. Here is a solution that seems to work at least for my case of lists with structs: https://github.com/ZJONSSON/parquetjs/pull/34 Test case from parquetjs to Athena can be found here: https://github.com/ZJONSSON/parquetjs/blob/9cee1592ce41e8dbca088fa2330b48ceb2d1de1a/test/list.js
I generated a parquet file with parquet.js with data containing list and map, but the nested field is not readable by AWS Athena, which is based on PrestoDB. I checked other implementations and it seem this is the reason https://github.com/apache/parquet-mr/pull/411
Thank you for the great job all the same.