segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
342 stars 98 forks source link

Nested fields not readable by python lib polars #468

Open oscar6echo opened 1 year ago

oscar6echo commented 1 year ago

I want to read/write parquet files in go to read/write them in python/polars.
It seems that the nested fields ([]int in my example) written by one lib cannot be read by the other. Then instead it returns empty lists.

I placed an issue with polars, but it concerns segmentio/parquet-go symmetrically.
See https://github.com/pola-rs/polars/issues/6428

I am surprised that parquet compability can be partial.
So I am looking for a way, if possible, to create the nested field in such a way it is understood by polars.

achille-roussel commented 1 year ago

Hello @oscar6echo, thanks for reporting!

I looked at the issue you opened at https://github.com/pola-rs/polars/issues/6428, it appears that this is due to the underlying pyarrow library not supporting the DELTA_LENGTH_BYTE_ARRAY encoding, which is used by default in parquet-go.

...
  File "pyarrow/_dataset.pyx", line 332, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 2661, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: Not yet implemented: DecodeArrow for DeltaLengthByteArrayDecoder.

I am surprised that parquet compability can be partial.

This is somewhat a double-edge sword with parquet, each column can have a different encoding, and clients can choose to support a subset of the parquet spec, which eventually can result in incompatibilities like the one you describe. parquet-go actually follows the spec which says that DELTA_LENGTH_BYTE_ARRAY should be the default encoding for byte array columns.

Is there some other way to create the nested field that would increase compatility with other libs like polars ?

All parquet clients must support the PLAIN encoding, which is less efficient but simpler to implement, so I would recommend in your case that you force the encoding in your schema, for example by adding the plain marker to the parquet tag of Go struct fields:

type Row struct {
    Idx    int       `parquet:"idx,plain"`
    Name   string    `parquet:"name,plain"`
    ...
}

With this approach, you will be trading off storage space for greater chance of compatibility between parquet clients.

Where in the source code is the nested field creation performed ?

This is a broad question as it depends on how the schema is constructed, but here are a few entry points:

I hope these answers are useful to you, let me know if you have any other question!

oscar6echo commented 1 year ago

@achille-roussel thx for the fast and comprehensive reply !

I tried to use the plain option to see if it would achieve compatibility. But then I realised that I in fact got mixed up in my latest comment.

The files produced by parquet-go and polars in this example are indeed incompatible - due to the nested field. But you gave me tips that helped me maybe pintpoint the difference.

1/ schema produced/read by parquet-go:

type Row struct { Idx int parquet:"idx" Name string parquet:"name" Age int parquet:"age" Sex bool parquet:"sex" Weight float64 parquet:"weight" Time time.Time parquet:"time" Arr []int parquet:"array" } schema := parquet.SchemaOf(new(Row)) fmt.Println(schema)


+ result:
```sh
message Row {
    required int64 idx (INT(64,true));
    required binary name (STRING);
    required int64 age (INT(64,true));
    required boolean sex;
    required double weight;
    required int64 time (TIMESTAMP(isAdjustedToUTC=true,unit=NANOS));
    repeated int64 array (INT(64,true));
}

2/ schema produced/read by polars:

now = dt.datetime.now()

s1 = pl.Series("idx", [0, 1], dtype=pl.Int64) s1 = pl.Series("name", ["Masterfog", "Armspice"], dtype=pl.Utf8) s2 = pl.Series("age", [22, 23], dtype=pl.Int64) s3 = pl.Series("sex", [True, False], dtype=pl.Boolean) s4 = pl.Series("weight", [51.2, 65.3], dtype=pl.Float64) s5 = pl.Series("time", [now, now], dtype=pl.Datetime) s6 = pl.Series("array", [[10, 20], [11, 22]], dtype=pl.List(pl.Int64)) df = pl.DataFrame([s1, s2, s3, s4, s5, s6]) path = Path("sample3.pqt") df.write_parquet(path)

h = pq.ParquetFile(path) print(h.schema)


+ result:

<pyarrow._parquet.ParquetSchema object at 0x7fd8fb70d540> required group field_id=-1 schema { optional binary field_id=-1 name (String); optional int64 field_id=-1 age; optional boolean field_id=-1 sex; optional double field_id=-1 weight; optional int64 field_id=-1 time (Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)); optional group field_id=-1 array (List) { repeated group field_id=-1 list { optional int64 field_id=-1 item; } } }


The inspection of the schemas produced by each lib show the difference in parquet format. There is a `group` notion in the polars generated parquet that does not exist in parquet-go.:

// parquet-go repeated int64 array (INT(64,true));

vs.

// polars optional group field_id=-1 array (List) { repeated group field_id=-1 list { optional int64 field_id=-1 item; }



So if I could create the polars format with parquet-go (possibly using intermediate structs ?) then it should achieve compatibility.   
Does it make sense ?  

 I made an attempt to mirror the parquet structure of polars. But it does not seem to work immediately. See [trial.go](https://github.com/oscar6echo/parquet-go-explo/blob/main/io/trial.go).

Is that possible ?  
I recognize polars struct nested structure seem a bit convoluted but probably makes sense in their context.  
Else I should probably try a lower level lib. In this case, any recommendation ?

I would be grateful for any hint as I think the ability to interact with a polars dataframe from outside its ecosystem is quite interesting - particularly from "Goland" as Go is quite complementary with Python.