trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.48k stars 3.01k forks source link

[Proposal] [Hive connector] Use column statistics to prune partitions #11941

Closed damnMeddlingKid closed 2 years ago

damnMeddlingKid commented 2 years ago

I'd like to propose a feature to allow utilizing per partition column statistics (like min/max) to filter out partitions that don't overlap with the predicate. If partitions don't contain statistics they would not be filtered out.

This behaviour can be controlled by a session variable or table property so that it is only enabled for tables where the user knows that the statistics are fresh and accurate.

I believe it would be beneficial in ETL pipelines where partitions are not updated after they are inserted.

raunaqmorarka commented 2 years ago

I believe similar ideas about using statistics outside of the CBO have been discussed in the past. We want to avoid using stats for query results or pruning or anything outside CBO because it carries the risk of wrong results. We also want to avoid adding features behind toggles that are not generally safe to use. You can look at using iceberg or delta instead where we can safely use file level min/max statistics to prune splits.

damnMeddlingKid commented 2 years ago

Thanks for the clarification!.