trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.25k stars 2.95k forks source link

Add Iceberg dereference pushdown on the physical input size level #17156

Open findinpath opened 1 year ago

findinpath commented 1 year ago

As showcased on https://github.com/trinodb/trino/pull/17145 , dereference pushdown does not work yet on the physical level for Iceberg. Even though the connector filters out nested data information which is not needed, this data is still read from ORC/Parquet (haven't checked yet AVRO) files.

Add the necessary logic to strip the schema being read from the columnar data files so that only the relevant nested information is read from the source file.

findepi commented 1 year ago

cc @raunaqmorarka (per https://github.com/trinodb/trino/issues/17201)

krvikash commented 1 year ago

It seems the same issue is with avro format as well.

findepi commented 1 year ago

Avro isn't columnar so we may not be able to improve Avro reads, but that's a non-goal for this issue. Let's have this issue focused on ORC and Parquet.