trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.35k stars 2.98k forks source link

JdbcMetadata#applyFilter should not apply filter when limit is already pushed down #5535

Open sopel39 opened 4 years ago

sopel39 commented 4 years ago

Pushing predicate through limit might change query semantics (e.g when data is ordered). We don't push predicate through limit in Presto.

CC: @findepi

findepi commented 4 years ago

See https://github.com/prestosql/presto/pull/3691#discussion_r423344082 and especially https://github.com/prestosql/presto/pull/3691#discussion_r424299090

sopel39 commented 4 years ago

Applying limit is not even beneficial from performance point of view. We most likely can finish quicker without filtering rows than with extra filtering.

findepi commented 4 years ago

That's correct, but it may happen that applyFilter is simply invoked before LIMIT makes it into proximity of the TS. In such a case, the filter has already been applied and there is (currently) no way back. Then the only choice is whether you also apply the limit or not.

findepi commented 4 years ago

and especially #3691 (comment)

for ref, here is the comment's content:

I realized we had this discussion some time ago with JDBC connectors, but cannot find it now. @martint @electrum ? Unfortunately we didn't capture much of the thinking in the code, and it looks as if this was not addressed.

sopel39 commented 4 years ago

That's correct, but it may happen that applyFilter is simply invoked before LIMIT makes it into proximity of the TS.

That would be pushdown of operators in same order as they are ordered in Presto. This doesn't change query semantics.

Then the only choice is whether you also apply the limit or not.

If not applying filter would improve performance, I think we should not apply it.

Praveen2112 commented 3 years ago

What if we could maintain the order and generate the Query based on the order For a query plan like

Filter ->Limit-> TableScan

SELECT col.. FROM (SELECT * FROM tbl LIMIT n) WHERE <filter clause> 

Limit ->Filter -> TableScan

SELECT col.. FROM (SELECT * FROM tbl WHERE <filter clause>) LIMIT n
sopel39 commented 3 years ago

Another though to this issue. Pushdown of FILTER after LIMIT might make sense if there is an aggregation e.g:

Filter -> Limit -> Aggregation -> TableScan

if filter is on aggregation keys, then less groups might need to be evaluated.