Prune unreferenced struct fields in ORC reader for IO

trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

https://trino.io

Apache License 2.0

10.41k stars 3k forks source link

Prune unreferenced struct fields in ORC reader for IO #17201

Open raunaqmorarka opened 1 year ago

raunaqmorarka commented 1 year ago

Orc reader is currently relying on lazy loading of blocks to avoid decoding unreferenced struct fields. But it's still using all fields in the struct when populating structures to plan reads from orc file. This can lead to over reading from file system due to the merging of nearby small reads in the file into larger reads. Parquet reader avoids this by dropping all the unreferenced fields of struct when planning IO. Orc reader can be improved to do the same.

fyi @findepi @findinpath @dain

findepi commented 1 year ago

Is this same as or related to @findinpath 's https://github.com/trinodb/trino/issues/17156?

raunaqmorarka commented 1 year ago

Is this same as or related to @findinpath 's #17156?

I'm assuming that #17156 is iceberg specific where iceberg logic needs to be fixed for parquet (hive+parquet works as expected). In this issue I'm referring to orc reader problem which affects both hive and iceberg connectors. We could close this one if it's less confusing to track both problems in #17156