slingdata-io / sling-cli

Sling is a CLI tool that extracts data from a source storage/database and loads it in a target storage/database.
https://docs.slingdata.io
GNU General Public License v3.0
299 stars 16 forks source link

Schema name missing for parquet file writing #315

Closed aymhce closed 3 weeks ago

aymhce commented 3 weeks ago

Issue Description

func getParquetSchema(cols Columns) *parquet.Schema {
    return parquet.NewSchema("", NewRecNode(cols))
}

Commonly, a parquet schema structure with a name :

message my_schema {
  REQUIRED BYTE_ARRAY element (STRING);
  ...
}

SlingData generate this :

message {
  REQUIRED BYTE_ARRAY element (STRING);
  ...
}

And PXF believe the '{' term is a name of the schema and start to read the schema, but it fall on 'required' instead of '{' term, and it fail (see logs)...

So, can you add a schema name like this please ?

func getParquetSchema(cols Columns) *parquet.Schema {
    return parquet.NewSchema("slingdata_schema", NewRecNode(cols))
}

Thanks you all your jobs for us !

source: 'local://'
target: S3_BUCKET

streams:
  'file://./pays3.csv':
    select: ["LIBELLE"]
    #object: file://./pays4.parquet
    object: s3://gp-dev/pays3.parquet
    target_options:
      compression: none
replace_accents]
    primary_key: ["LIBELLE"]

Nothing else for Sling :

12:21PM INF Sling Replication [1 streams] | local:// -> S3_BUCKET

12:21PM INF [1 / 1] running stream file://./pays3.csv
12:21PM INF reading from source file system (file)
12:21PM INF writing to target file system (s3)
12:21PM INF wrote 1 rows to s3://gp-dev/pays3.parquet in 0 secs [5 r/s]
12:21PM INF execution succeeded

12:21PM INF Sling Replication Completed in 0s | local:// -> S3_BUCKET | 1 Successes | 0 Failures

But fail for PXF :

024-05-30 16:38:21.658 CEST ERROR [1715691517-0002894674:s3_maifvie:14 ] 3792321 --- [sponse-412] o.g.p.s.c.PxfErrorReporter               : start of message: expected '{' but got 'required' at line 1:   required

java.lang.IllegalArgumentException: start of message: expected '{' but got 'required' at line 1:   required
        at org.apache.parquet.schema.MessageTypeParser.check(MessageTypeParser.java:239) ~[parquet-column-1.11.1.jar!/:1.11.1]
        at org.apache.parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:99) ~[parquet-column-1.11.1.jar!/:1.11.1]
        at org.apache.parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:94) ~[parquet-column-1.11.1.jar!/:1.11.1]
        at org.apache.parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:84) ~[parquet-column-1.11.1.jar!/:1.11.1]
        at org.apache.parquet.hadoop.api.ReadSupport.getSchemaForRead(ReadSupport.java:51) ~[parquet-hadoop-1.11.1.jar!/:1.11.1]
        at org.apache.parquet.hadoop.example.GroupReadSupport.init(GroupReadSupport.java:38) ~[parquet-hadoop-1.11.1.jar!/:1.11.1]
        at org.apache.parquet.hadoop.api.ReadSupport.init(ReadSupport.java:85) ~[parquet-hadoop-1.11.1.jar!/:1.11.1]
        at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:179) ~[parquet-hadoop-1.11.1.jar!/:1.11.1]
        at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156) ~[parquet-hadoop-1.11.1.jar!/:1.11.1]
        at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) ~[parquet-hadoop-1.11.1.jar!/:1.11.1]
        at org.greenplum.pxf.plugins.hdfs.ParquetFileAccessor.readNextObject(ParquetFileAccessor.java:188) ~[pxf-hdfs-6.4.2.jar!/:?]
        ...
flarco commented 3 weeks ago

Merged #316, Thanks 👍