There are many dictionary ID value decoders in the Parquet batch reader. They usually allocates a buffer in every readNext call and it is bad for reliability and performance. There is no need to create a separate decoder and add unnecessary memory allocation and memory copies. It would be nice to send a new PR to unify existing RLE dictionary decoders. After all, dictionary IDs can only be RLE/BP encoded, and is not relevant to the data column types.
Ref: https://parquet.apache.org/docs/file-format/data-pages/encodings/
"Data page format: the bit width used to encode the entry ids stored as 1 byte (max bit width = 32), followed by the values encoded using RLE/Bit packed described above (with the given bit width)."
There are many dictionary ID value decoders in the Parquet batch reader. They usually allocates a buffer in every readNext call and it is bad for reliability and performance. There is no need to create a separate decoder and add unnecessary memory allocation and memory copies. It would be nice to send a new PR to unify existing RLE dictionary decoders. After all, dictionary IDs can only be RLE/BP encoded, and is not relevant to the data column types.
Ref: https://parquet.apache.org/docs/file-format/data-pages/encodings/ "Data page format: the bit width used to encode the entry ids stored as 1 byte (max bit width = 32), followed by the values encoded using RLE/Bit packed described above (with the given bit width)."
See https://github.com/prestodb/presto/pull/23584