Nested fields not readable by python lib polars

oscar6echo commented 1 year ago

I want to read/write parquet files in go to read/write them in python/polars.
It seems that the nested fields ([]int in my example) written by one lib cannot be read by the other. Then instead it returns empty lists.

I placed an issue with polars, but it concerns segmentio/parquet-go symmetrically.
See https://github.com/pola-rs/polars/issues/6428

I am surprised that parquet compability can be partial.
So I am looking for a way, if possible, to create the nested field in such a way it is understood by polars.

Here is the parquet file creation
Is there some other way to create the nested field that would increase compatility with other libs like polars ?
Where in the source code is the nested field creation performed ?

achille-roussel commented 1 year ago

Hello @oscar6echo, thanks for reporting!

I looked at the issue you opened at https://github.com/pola-rs/polars/issues/6428, it appears that this is due to the underlying pyarrow library not supporting the DELTA_LENGTH_BYTE_ARRAY encoding, which is used by default in parquet-go.

...
  File "pyarrow/_dataset.pyx", line 332, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 2661, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: Not yet implemented: DecodeArrow for DeltaLengthByteArrayDecoder.

I am surprised that parquet compability can be partial.

This is somewhat a double-edge sword with parquet, each column can have a different encoding, and clients can choose to support a subset of the parquet spec, which eventually can result in incompatibilities like the one you describe. parquet-go actually follows the spec which says that DELTA_LENGTH_BYTE_ARRAY should be the default encoding for byte array columns.

Is there some other way to create the nested field that would increase compatility with other libs like polars ?

All parquet clients must support the PLAIN encoding, which is less efficient but simpler to implement, so I would recommend in your case that you force the encoding in your schema, for example by adding the plain marker to the parquet tag of Go struct fields:

type Row struct {
    Idx    int       `parquet:"idx,plain"`
    Name   string    `parquet:"name,plain"`
    ...
}

With this approach, you will be trading off storage space for greater chance of compatibility between parquet clients.

Where in the source code is the nested field creation performed ?

This is a broad question as it depends on how the schema is constructed, but here are a few entry points:

I hope these answers are useful to you, let me know if you have any other question!

oscar6echo commented 1 year ago

@achille-roussel thx for the fast and comprehensive reply !

I tried to use the plain option to see if it would achieve compatibility. But then I realised that I in fact got mixed up in my latest comment.

The files produced by parquet-go and polars in this example are indeed incompatible - due to the nested field. But you gave me tips that helped me maybe pintpoint the difference.

1/ schema produced/read by parquet-go:

code

type Row struct { Idx int parquet:"idx" Name string parquet:"name" Age int parquet:"age" Sex bool parquet:"sex" Weight float64 parquet:"weight" Time time.Time parquet:"time" Arr []int parquet:"array" } schema := parquet.SchemaOf(new(Row)) fmt.Println(schema)


+ result:
```sh
message Row {
    required int64 idx (INT(64,true));
    required binary name (STRING);
    required int64 age (INT(64,true));
    required boolean sex;
    required double weight;
    required int64 time (TIMESTAMP(isAdjustedToUTC=true,unit=NANOS));
    repeated int64 array (INT(64,true));
}

2/ schema produced/read by polars:

code


import datetime as dt
from pathlib import Path
import polars as pl
import pyarrow.parquet as pq

now = dt.datetime.now()

s1 = pl.Series("idx", [0, 1], dtype=pl.Int64) s1 = pl.Series("name", ["Masterfog", "Armspice"], dtype=pl.Utf8) s2 = pl.Series("age", [22, 23], dtype=pl.Int64) s3 = pl.Series("sex", [True, False], dtype=pl.Boolean) s4 = pl.Series("weight", [51.2, 65.3], dtype=pl.Float64) s5 = pl.Series("time", [now, now], dtype=pl.Datetime) s6 = pl.Series("array", [[10, 20], [11, 22]], dtype=pl.List(pl.Int64)) df = pl.DataFrame([s1, s2, s3, s4, s5, s6]) path = Path("sample3.pqt") df.write_parquet(path)

h = pq.ParquetFile(path) print(h.schema)


+ result:

<pyarrow._parquet.ParquetSchema object at 0x7fd8fb70d540> required group field_id=-1 schema { optional binary field_id=-1 name (String); optional int64 field_id=-1 age; optional boolean field_id=-1 sex; optional double field_id=-1 weight; optional int64 field_id=-1 time (Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)); optional group field_id=-1 array (List) { repeated group field_id=-1 list { optional int64 field_id=-1 item; } } }


The inspection of the schemas produced by each lib show the difference in parquet format. There is a `group` notion in the polars generated parquet that does not exist in parquet-go.:

// parquet-go repeated int64 array (INT(64,true));

vs.

// polars optional group field_id=-1 array (List) { repeated group field_id=-1 list { optional int64 field_id=-1 item; }



So if I could create the polars format with parquet-go (possibly using intermediate structs ?) then it should achieve compatibility.   
Does it make sense ?  

 I made an attempt to mirror the parquet structure of polars. But it does not seem to work immediately. See [trial.go](https://github.com/oscar6echo/parquet-go-explo/blob/main/io/trial.go).

Is that possible ?  
I recognize polars struct nested structure seem a bit convoluted but probably makes sense in their context.  
Else I should probably try a lower level lib. In this case, any recommendation ?

I would be grateful for any hint as I think the ability to interact with a polars dataframe from outside its ecosystem is quite interesting - particularly from "Goland" as Go is quite complementary with Python.

segmentio / parquet-go

Nested fields not readable by python lib polars #468