Extracting encodings in file schema metadata

thorfour commented 1 year ago

We recently came across some unexpected behavior when parsing written file schemas. We noticed that some of the fields returned the encoding that the columns were written with, while others returned PLAIN even though they were written with RLE_DICTIONARY

Digging through the code we can see that those columns are given multiple encodings which are then sorted https://github.com/segmentio/parquet-go/blob/dd8318a577a976ef977b9796b10a6c07c61c6bf5/writer.go#L346-L356

However when reading the encodings back into the top-level file schema we only select the first (sorted) encoding for the column chunk https://github.com/segmentio/parquet-go/blob/dd8318a577a976ef977b9796b10a6c07c61c6bf5/column.go#L334-L336

Which is what's causing this behavior, where the schema written doesn't match the schema read.

Would it be possible to not strictly sort the encodings, and always have the provided encoding be the first in the list, or would it be possible for the top-level schemas to contain a slice of encodings instead of just the single? Or maybe unravel the logic of creating the encodings during writes to derive the provided encoding. Basically some way to have the read/written schemas match.

abraithwaite commented 1 year ago

I'm not familiar enough with the spec to say whether or not this would break anything, but what you're saying sounds reasonable enough.

It may be related to parquet merges, according to the git blame, when the sortPageEncoding was added:

https://github.com/segmentio/parquet-go/pull/45

achille-roussel commented 1 year ago

Fixed by #356

segmentio / parquet-go

Extracting encodings in file schema metadata #355