prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
16.06k stars 5.38k forks source link

Named queries (WITH) should be evaluated only once #8758

Open martint opened 7 years ago

martint commented 7 years ago

Currently, Presto evaluates named queries (WITH clause) by inlining them wherever they are referenced in the body of the main query. This can cause problems if the query is non-deterministic, since every inlined instance might produce different results.

According to the spec:

1) If a non-recursive <with clause> is specified, then:
    ...
    b) For every <with list element> WLE, let WQN be the <query name> immediately contained in WLE. Let WQE be 
       the <query expression> simply contained in WLE. Let WLT be the table resulting from evaluation of WQE, with 
       each column name replaced by the corresponding element of the <with column list>, if any, immediately contained in WLE.
    c) Every <table reference> contained in <query expression> that specifies WQN identifies WLT.

This says that each named query should be evaluated once and the result "stored" in a table WLT. Thereafter, any references to the named query directly identify that WLT table.

martint commented 7 years ago

WLT in the definition above is what the spec calls "transient table":

4.15.4 Transient tables A transient table is a named table that may come into existence implicitly during the evaluation of a <query expression> or the execution of a trigger. A transient table is identified by a <query name> if it arises during the evaluation of a <query expression>, or by a <transition table name> if it arises during the execution of a trigger. Such tables exist only for the duration of the executing SQL-statement containing the <query expression> or for the duration of the executing trigger.

sirlori commented 3 years ago

@martint Hello!

Is this closed because the issue was resolved? If so, does also this one need closing?

martint commented 3 years ago

@sirlori, I'm no longer working on the prestodb repository, so this issue here is no longer relevant to me.

jaystarshot commented 3 months ago

We have CTE materialization support now for the Hive connector and this behavior should be attainable by using the all strategy.
Else, we could add a strategy to detect non-deterministic ctes and automaterialize them, that can be a performance penalty depending upon the use cases. Cte materialization documentation