redpanda-data / connect

Fancy stream processing made operationally mundane
https://docs.redpanda.com/redpanda-connect/about/
8.15k stars 840 forks source link

Add support for parquet logical types to parquet_encode processor #1392

Open mihaitodor opened 2 years ago

mihaitodor commented 2 years ago

In some cases, users will need to specify the logical type in the schema field. Details here: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

For example, when using type: BYTE_ARRAY to encode a string value, they might want to set the logical type to STRING so decoders will be able to interpret it correctly. For example, given this config:

input:
  generate:
    mapping: root.test = "deadbeef"
    count: 1
    interval: 0s

pipeline:
  processors:
    - parquet_encode:
        schema:
          - name: test
            type: BYTE_ARRAY

output:
  file:
    path: output.parquet
    codec: all-bytes

will produce a parquet binary which, when decoded with parquet-tools will contain a base64-encoded value:

> docker run --rm -v$(pwd):/tmp/parquet nathanhowell/parquet-tools cat /tmp/parquet/output.parquet
test = ZGVhZGJlZWY=

however, if we change this line of code to n = parquet.String(), then parquet-tools will output test = deadbeef.

mihaitodor commented 2 years ago

This issue seems to also cause Pandas to fail with "OSError: Not yet implemented: DecodeArrow for DeltaLengthByteArrayDecoder.": https://github.com/segmentio/parquet-go/issues/325

Jeffail commented 2 years ago

I've added a UTF8 option for column values: https://github.com/benthosdev/benthos/commit/07ed81b150778a362e25e52428c59a05ca21369b as a quick work around. Technically I think we ought to be exposing logical types with a seperate field but we can cross that bridge later.