Closed thorfour closed 1 year ago
I'm not familiar enough with the spec to say whether or not this would break anything, but what you're saying sounds reasonable enough.
It may be related to parquet merges, according to the git blame, when the sortPageEncoding was added:
Fixed by #356
We recently came across some unexpected behavior when parsing written file schemas. We noticed that some of the fields returned the encoding that the columns were written with, while others returned
PLAIN
even though they were written withRLE_DICTIONARY
Digging through the code we can see that those columns are given multiple encodings which are then sorted https://github.com/segmentio/parquet-go/blob/dd8318a577a976ef977b9796b10a6c07c61c6bf5/writer.go#L346-L356
However when reading the encodings back into the top-level file schema we only select the first (sorted) encoding for the column chunk https://github.com/segmentio/parquet-go/blob/dd8318a577a976ef977b9796b10a6c07c61c6bf5/column.go#L334-L336
Which is what's causing this behavior, where the schema written doesn't match the schema read.
Would it be possible to not strictly sort the encodings, and always have the provided encoding be the first in the list, or would it be possible for the top-level schemas to contain a slice of encodings instead of just the single? Or maybe unravel the logic of creating the encodings during writes to derive the provided encoding. Basically some way to have the read/written schemas match.