Description of the issue: Hi, we use PXF engine with Greenplum Database for data federation to consume a parquet file produce with slingdata, in a S3 storage. In my case, PXF can't read the parquet file produce with slingdata, because there is schema name missing. Due of this part of code in parquet.go :
Commonly, a parquet schema structure with a name :
message my_schema {
REQUIRED BYTE_ARRAY element (STRING);
...
}
SlingData generate this :
message {
REQUIRED BYTE_ARRAY element (STRING);
...
}
And PXF believe the '{' term is a name of the schema and start to read the schema, but it fall on 'required'
instead of '{' term, and it fail (see logs)...
12:21PM INF Sling Replication [1 streams] | local:// -> S3_BUCKET
12:21PM INF [1 / 1] running stream file://./pays3.csv
12:21PM INF reading from source file system (file)
12:21PM INF writing to target file system (s3)
12:21PM INF wrote 1 rows to s3://gp-dev/pays3.parquet in 0 secs [5 r/s]
12:21PM INF execution succeeded
12:21PM INF Sling Replication Completed in 0s | local:// -> S3_BUCKET | 1 Successes | 0 Failures
But fail for PXF :
024-05-30 16:38:21.658 CEST ERROR [1715691517-0002894674:s3_maifvie:14 ] 3792321 --- [sponse-412] o.g.p.s.c.PxfErrorReporter : start of message: expected '{' but got 'required' at line 1: required
java.lang.IllegalArgumentException: start of message: expected '{' but got 'required' at line 1: required
at org.apache.parquet.schema.MessageTypeParser.check(MessageTypeParser.java:239) ~[parquet-column-1.11.1.jar!/:1.11.1]
at org.apache.parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:99) ~[parquet-column-1.11.1.jar!/:1.11.1]
at org.apache.parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:94) ~[parquet-column-1.11.1.jar!/:1.11.1]
at org.apache.parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:84) ~[parquet-column-1.11.1.jar!/:1.11.1]
at org.apache.parquet.hadoop.api.ReadSupport.getSchemaForRead(ReadSupport.java:51) ~[parquet-hadoop-1.11.1.jar!/:1.11.1]
at org.apache.parquet.hadoop.example.GroupReadSupport.init(GroupReadSupport.java:38) ~[parquet-hadoop-1.11.1.jar!/:1.11.1]
at org.apache.parquet.hadoop.api.ReadSupport.init(ReadSupport.java:85) ~[parquet-hadoop-1.11.1.jar!/:1.11.1]
at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:179) ~[parquet-hadoop-1.11.1.jar!/:1.11.1]
at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156) ~[parquet-hadoop-1.11.1.jar!/:1.11.1]
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) ~[parquet-hadoop-1.11.1.jar!/:1.11.1]
at org.greenplum.pxf.plugins.hdfs.ParquetFileAccessor.readNextObject(ParquetFileAccessor.java:188) ~[pxf-hdfs-6.4.2.jar!/:?]
...
Issue Description
parquet.go
:Commonly, a parquet schema structure with a name :
SlingData generate this :
And PXF believe the '{' term is a name of the schema and start to read the schema, but it fall on 'required' instead of '{' term, and it fail (see logs)...
So, can you add a schema name like this please ?
Thanks you all your jobs for us !
Sling version (
sling --version
): v1.2.11Operating System (
linux
,mac
,windows
): windows, macReplication Configuration:
-d
):Nothing else for Sling :
But fail for PXF :