prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
15.96k stars 5.34k forks source link

Unify the Parquet dictionary value decoders #23612

Open yingsu00 opened 2 weeks ago

yingsu00 commented 2 weeks ago

There are many dictionary ID value decoders in the Parquet batch reader. They usually allocates a buffer in every readNext call and it is bad for reliability and performance. There is no need to create a separate decoder and add unnecessary memory allocation and memory copies. It would be nice to send a new PR to unify existing RLE dictionary decoders. After all, dictionary IDs can only be RLE/BP encoded, and is not relevant to the data column types.

Ref: https://parquet.apache.org/docs/file-format/data-pages/encodings/ "Data page format: the bit width used to encode the entry ids stored as 1 byte (max bit width = 32), followed by the values encoded using RLE/Bit packed described above (with the given bit width)."

See https://github.com/prestodb/presto/pull/23584

yingsu00 commented 2 weeks ago

cc @ethanyzhang