trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.42k stars 3k forks source link

Reuse sub-expressions #1070

Open findepi opened 5 years ago

findepi commented 5 years ago

In case of a query that computes a deterministic sub-expression multiple times, we should compute it only once.

For example, the following query could compute split(x, '_')[2] once and reuse.

presto:tiny> EXPLAIN SELECT
          ->     'One ' || split(x, '_')[2] AS a,
          ->     'Two ' || split(x, '_')[2] AS b
          -> FROM (VALUES 'XXX_1234_YYYY') t(x);
                                                Query Plan
----------------------------------------------------------------------------------------------------------
 Output[a, b]
 │   Layout: [concat:varchar, concat_2:varchar]
 │   Estimates: {rows: 1 (110B), cpu: 165, memory: 0B, network: 0B}
 │   a := concat
 │   b := concat_2
 └─ Project[]
    │   Layout: [concat:varchar, concat_2:varchar]
    │   Estimates: {rows: 1 (110B), cpu: 165, memory: 0B, network: 0B}
    │   concat := "concat"(CAST('One ' AS varchar), CAST("split"("field", '_')[BIGINT '2'] AS varchar))
    │   concat_2 := "concat"(CAST('Two ' AS varchar), CAST("split"("field", '_')[BIGINT '2'] AS varchar))
    └─ LocalExchange[ROUND_ROBIN] ()
       │   Layout: [field:varchar(13)]
       │   Estimates: {rows: 1 (55B), cpu: 55, memory: 0B, network: 0B}
       └─ Values
              Layout: [field:varchar(13)]
              Estimates: {rows: 1 (55B), cpu: 0, memory: 0B, network: 0B}
              ('XXX_1234_YYYY')

(1 row)
sopel39 commented 5 years ago

The problem here could be estimating when it's worth to do subexpression extraction (how complex it should be to consider extraction). On one hand, CPU could optimize expression evaluation. On the other hand, if you added another projection for that subexpression then the overhead might not be worth it.

Perhaps some approach when common subexpression is evaluated once within generated bytecode for two expressions is feasible and would not introduce any overhead.

wilhelmjung commented 2 years ago

Hi @findepi I am just wondering. Is this optimization related to common sub-expression elimination? As it has been implemented in Presto 0.245 version. Thanks.