trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.5k stars 3.02k forks source link

Track uncompressed data size for varchar and varbinary in Iceberg #15150

Open findepi opened 2 years ago

findepi commented 2 years ago

Iceberg column_sizes is

Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers.

(https://iceberg.apache.org/spec/)

We use this field to fill io.trino.spi.statistics.ColumnStatistics#dataSize in Iceberg, but this should be uncompressed data size.

findepi commented 2 years ago

cc @danielcweeks @homar @ebyhr @alexjo2144

homar commented 1 year ago

@findepi I think short term is done (https://github.com/trinodb/trino/pull/15186)