prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
16.06k stars 5.38k forks source link

Optimize coalesce of join keys #23603

Open kaikalur opened 2 months ago

kaikalur commented 2 months ago

We have seen queries like:

COALESCE(l.id, r.id) FROM l INNER/LEFT JOIN r ON l.id=r.id

Here for both inner ad left joins we can simplify it to l.id since it's a join key. Similarly for the converse for right join. We should apply this rule before join reordering

aaneja commented 1 month ago

This already works for INNER JOINs -

presto:tiny> explain select COALESCE(l.nationkey, r.regionkey) FROM nation l inner join region r on l.nationkey=r.regionkey;
                                                                                                                                                                                    Query Plan                                                                                          >
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------->
 - Output[PlanNodeId 9][_col0] => [nationkey:bigint]                                                                                                                                                                                                                                    >
         Estimates: {source: CostBasedSourceInfo, rows: 5 (45B), cpu: 630.00, memory: 45.00, network: 90.00}                                                                                                                                                                            >
         _col0 := nationkey (1:16)                                                                                                                                                                                                                                                      >
     - RemoteStreamingExchange[PlanNodeId 308][GATHER] => [nationkey:bigint]                                                                                                                                                                                                            >
             Estimates: {source: CostBasedSourceInfo, rows: 5 (45B), cpu: 630.00, memory: 45.00, network: 90.00}                                                                                                                                                                        >
         - InnerJoin[PlanNodeId 286][("nationkey" = "regionkey_0")] => [nationkey:bigint]                                                                                                                                                                                               >
                 Estimates: {source: CostBasedSourceInfo, rows: 5 (45B), cpu: 630.00, memory: 45.00, network: 45.00}                                                                                                                                                                    >
                 Distribution: REPLICATED                                                                                                                                                                                                                                               >
             - TableScan[PlanNodeId 0][TableHandle {connectorId='tpch', connectorHandle='nation:sf0.01', layout='Optional[nation:sf0.01]'}] => [nationkey:bigint]                                                                                                                       >
                     Estimates: {source: CostBasedSourceInfo, rows: 25 (225B), cpu: 225.00, memory: 0.00, network: 0.00}                                                                                                                                                                >
                     nationkey := tpch:nationkey (1:56)                                                                                                                                                                                                                                 >
             - LocalExchange[PlanNodeId 334][HASH] (regionkey_0) => [regionkey_0:bigint]                                                                                                                                                                                                >
                     Estimates: {source: CostBasedSourceInfo, rows: 5 (45B), cpu: 90.00, memory: 0.00, network: 45.00}                                                                                                                                                                  >
                 - RemoteStreamingExchange[PlanNodeId 307][REPLICATE] => [regionkey_0:bigint]                                                                                                                                                                                           >
                         Estimates: {source: CostBasedSourceInfo, rows: 5 (45B), cpu: 45.00, memory: 0.00, network: 45.00}                                                                                                                                                              >
                     - TableScan[PlanNodeId 1][TableHandle {connectorId='tpch', connectorHandle='region:sf0.01', layout='Optional[region:sf0.01]'}] => [regionkey_0:bigint]                                                                                                             >
                             Estimates: {source: CostBasedSourceInfo, rows: 5 (45B), cpu: 45.00, memory: 0.00, network: 0.00}                                                                                                                                                           >
                             regionkey_0 := tpch:regionkey (1:76) 

For outer joins, we would need to set the joins_not_null_inference_strategy to a value other than NONE, so that we add NOT NULL filters on the outer side of the join. Once this is set, we could optimize out the COALESCE