For use cases like a JOIN after a UNION ALL, it can be beneficial to have the UNION ALL run in a single stage.
Existing behavior
For example, consider this simplified fragment of the TPCDS Q5 query -
select 1
from (
select cs_catalog_page_sk as page_sk,
cs_sold_date_sk as date_sk
from catalog_sales
union all
select cr_catalog_page_sk as page_sk,
cr_returned_date_sk as date_sk
from catalog_returns
) salesreturns,
date_dim,
catalog_page
where date_sk = d_date_sk
and d_date between cast('2001-08-04' as date)
and date_add('day', 14, cast('2001-08-04' as date))
and page_sk = cp_catalog_page_sk
where we see that we have large data transfers from Fragment 1 & 2 to perform the UNION, only to have the very selective JOIN execute next and discard most rows
If instead, we could execute the table scans of catalog_sales and catalog_returns in a single stage, we would save the network cost of data transfer
Proposed behavior
Trino added a MultiSourcePartitionedScheduler in https://github.com/trinodb/trino/pull/17265, that allows the scheduler to run multiple source partitioned table scans in a single stage. This along with changes to how exchanges are added, improved the performance of TPCDS Q02 and Q05 significantly
The Trino distributed plan for the above query is -
However I am not sure if this performs the same at scale since the same nodes will do the work of scanning. Also not sure how soft affinity etc will be handled
For use cases like a JOIN after a UNION ALL, it can be beneficial to have the UNION ALL run in a single stage.
Existing behavior
For example, consider this simplified fragment of the TPCDS Q5 query -
The distributed Presto plan for this is -
where we see that we have large data transfers from Fragment 1 & 2 to perform the UNION, only to have the very selective JOIN execute next and discard most rows
If instead, we could execute the table scans of
catalog_sales
andcatalog_returns
in a single stage, we would save the network cost of data transferProposed behavior
Trino added a
MultiSourcePartitionedScheduler
in https://github.com/trinodb/trino/pull/17265, that allows the scheduler to run multiple source partitioned table scans in a single stage. This along with changes to how exchanges are added, improved the performance of TPCDS Q02 and Q05 significantlyThe Trino distributed plan for the above query is -
We can experiment with this to see if this is feasible to implement in Presto as well