trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.25k stars 2.95k forks source link

Serde improvments #14227

Open sopel39 opened 2 years ago

sopel39 commented 2 years ago

DictionaryBlockEncoding:

VariableWidthBlockEncoding

sopel39 commented 2 years ago

cc @lukasz-stec

leeyh0216 commented 1 year ago

Hi, @sopel39.

I have some question about this issue. Please understand even if the question is stupid.

  1. (DictionaryBlockEncoding) In the case of ORC or Parquet, the spec of the element constituting ids is Unsigned Integer. Will there be a problem if it is changed to short or byte?
  2. (DictionaryBlockEncoding) Even if it is changed to a short or byte type, wouldn't deserialization performance decrease because 2 byte padding must be inserted in the middle of the slice composed of short/byte elements during the deserialization process?

I am interested in the issue, but I want to understand the exact context, so I ask this question.

sopel39 commented 1 year ago

(DictionaryBlockEncoding) In the case of ORC or Parquet, the spec of the element constituting ids is Unsigned Integer. Will there be a problem if it is changed to short or byte?

This problem is unrelated to either ORC or Parquet.

DictionaryBlockEncoding) Even if it is changed to a short or byte type, wouldn't deserialization performance decrease because 2 byte padding must be inserted in the middle of the slice composed of short/byte elements during the deserialization process?

It's more about reducing the size of payload. Less payload, less processing along the way => win even if CPU usage stays the same