trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.54k stars 3.03k forks source link

Optimize Presto self-join queries #5877

Open sopel39 opened 4 years ago

sopel39 commented 4 years ago

Case study tpcds/q95:

WITH
  ws_wh AS (
   SELECT
     "ws1"."ws_order_number"
   FROM
     ${database}.${schema}.web_sales ws1
   , ${database}.${schema}.web_sales ws2
   WHERE ("ws1"."ws_order_number" = "ws2"."ws_order_number")
      AND ("ws1"."ws_warehouse_sk" <> "ws2"."ws_warehouse_sk")
) 

Web sales tables are large and are currently read twice by Presto. However, since this is a self join Presto could wait until build side is read and feed it back as a probe side. This would save one read of web_sales table. Such optimization requires that join is executed as repartitioned join (which for large tables would be the case).

It seems that we could try to optimize self joins both before and after reorder joins rule:

sopel39 commented 4 years ago

In case of queries like:

                   InnerJoin[x1.a = x2.a]
                  /                    \
             TS[x1]                     InnerJoin[x2.a = z.a]
                                          /              \
                                      TS[x2]             TS[z]

it can still be simplified to:

                        SelfJoin[x.a = x.a]
                                 |
                        InnerJoin[x.a = z.a]
                          /              \
                     TS[x]              TS[z]

That can be simplified when either: