This Github issue describes the design of the PrestoDB Query Execution Optimization for Broadcast Join by Replicated-Reads Strategy. The original design document can be found here. For reader's convenience, the content is summarized in this issue below.

1. Background

One of the most common best practices to build a parallel data warehousing is to design an optimal multidimensional schema e.g. star schema, or snowflake schema, in which a central table is known as the fact table and the other tables are known as dimension tables. Query comes with WHERE conditions on columns of the dimensions tables, grouping on columns of the dimension tables and aggregations based on columns of the fact tables.

In the MPP distributed system, there are multiple forms of distributed join methods, particularly we focus on join distribution methods supported in Presto: Collocated join, directed join, repartitioned join, broadcast join.

In this issue we focus on broadcast join. For broadcast join, all the records of T2 table happen on a single worker which is a single network data transfer. Then the data gets redistributed to all the other available N-1 worker nodes which hold the other table before the join is performed.Broadcast join can only happen on the builder side when the table size is smaller than the join_max_broadcast_table_size. Otherwise, the optimizer will choose to join by repartitioning both sides of join tables.

Note that this is a total of N+1 data transfers. Moreover, only the results of the table scan are potentially cached. Furthermore, the worker that is selected to do the caching is not necessarily chosen with any cache affinity.

In typical cases, dimension tables store supporting information to the fact table in a star schema database structure, or a snowflake schema where multiple dimension tables are involved in a single query. Running in low latency for many OLAP queries is very crucial for business success

In today’s implementation, presto scheduler instantiates a single stage for a worker to run data extraction, and then redistribute the entire dataset to all workers on a separate stage which builds a hash table in a set of tasks. This was not an ideal case for resource management. Scheduling tasks for a fact table joining multiple dimension tables in a snowflake schema using this optimization may result in better resource utilization because each plan fragment or stage can now manage the same workers to run tasks in parallel within a single stage to reduce data shuffling.

2. Proposal

In today’s Presto, collocated join is allowed only as a grouped execution when joining two bucketed tables for which the bucket properties are compatible. We propose to add a new collocated join strategy for broadcast join.

2.1 Optimization scenarios:

This feature introduces a new query execution strategy by enabling a collocated broadcast join where source data from small tables is pulled from remote cloud storage directly by all workers instead of having one worker being read table once and then send the entire data over the network to all other workers. This operation of allowing all workers directly pulling data from source storage is called replicated reads. Allowing replicated reads from dimension tables distributed across workers for parallel execution may speed up the query runtime for faster response time. Furthermore, given that all worker nodes N are now participating in scanning the data from remote source, the table scan is cacheable per all worker nodes regardless of node selection affinity for caching splits.

This optimization is feasible for data lake analytics on the cloud native environment, where the raw data is stored on a high bandwidth storage media supporting many parallel connections and capable of handling high concurrency of connection requests for remote data extraction, e.g AWS S3, Azure Blob Storage, Google Cloud BigStorage. E.g., Amazon S3 is a good use for this kind of optimization because it uses a scalable storage infrastructure with a top limit of 5500 GET/HEAD requests per second in an S3 bucket. S3 request workload is designed to scale up well under control before it gets maxed out.

This optimization will NOT fit for a use case where the data storage media has a limitation on handling multiple connections. That is, any remote data source that could only accept requests from multithreaded clients from a managed connection pool to instantiate multiple connections via JDBC/ODBC may not fit because there is a risk of overloading the server when scaling up the connection requests. E.g., MySQL has a default maximum permitted number of simultaneous client connection set to 151 which means a higher number of connections to fetch data from MySQL database may either results in not closing MySQL connection handlers properly or “too many connections” error, or server not responding due to sudden load spikes. This optimization is not ideal for remote storage that can only operate under a limited number of parallel connection threading given that most data pull tasks may be queued up that could increase the queue wait time as well as overall latency.

2.2 Impact to the workload:

This optimization reduces the data broadcast, relatively reducing network cost of data shuffle by trade-in with a higher number of connection concurrency to scan data by scheduler workers in parallel. The Replicated Join strategy now includes both "broadcast join" and "replicated read join" where broadcast join requires the table being replicated by a shuffle, vs. replicated join requires the table being replicated by reading it multiple times. Furthermore, it optimizes the scheduling policy to allow multiple dim data sources in a single stage for lower query latency.

In a later session, the benchmark test using Amazon S3 as a remote data source in cloud lake validates that to be able to do parallel reads to open hundreds of connections on the data source is more efficient than using a single connection and broadcasting the data across the cluster. It improves query performance up to 12% - 57% using extensive caching mechanisms.

3. Externals

3.1 Tuning knobs:

This optimization introduces following properties:

System Session Property: enable_colocated_join_for_dim_replicate_table

This property is a system level knob that allows the optimizer to properly structure the physical plan for qualified replicated reads tables to use replicated reads.

Default is FALSE to disable replicated reads optimization. As mentioned above, this optimization profiles for all broadcast joins. This is a non-hidden user facing property.

Hive Catalog Property: hive.overwrite-high-bandwidth-storage-for-replicated-reads
- This is defined as the hive connector’s catalog property. This knob allows the hive connector to decide if the table supports replicated reads.
- When the knob is TRUE, tables that store in S3 may be eligible for replicated reads and an indication flag is marked in the table handle to return to the optimizer for further process.
- When the knob is FALSE, none of the tables via hive catalog may be eligible for replicated reads.
- Default is FALSE, given that this optimization is dedicated to work with large scale cloud storage media. Users may turn on this knob after the type of storage media is ensured. This field is currently non-hidden and will be hidden in production.

3.2 Plan comparison:

Before:
After:

3.3 Other recommendations

This is a Cost Based Optimization to enhance execution strategy for a broadcast join using new replicated join. Users are recommended to collect table statistics to make a better join distribution method for cases where the query consists of more than one table. Users may experience plan differences in case of plan validation. It is noticeable that all builder sides of tables from broadcast joins are now in the same plan fragment as the probe side whereas before they were in separate fragments.
There are no changes in SQL Syntax.

4. Design

4.1 Design Overview

The essence of this feature is to reduce data shuffle cost to improve overall query runtime. In today’s presto, join by data replication today broadcasts the builder side of the join for which scheduler assigns splits to tasks on a single worker and then tasks in intermediate stages redistributes data from upstream tasks to other workers.

As shown below case 1, the scheduler sends splits of unpartitioned data by a single driver of worker 5 which redistributes the T2 data to other workers. So to optimize the network cost using replicated-reads join method, scheduler for Case 1 now sends tasks of splits to table T2 by multiple drivers in parallel by all 5 workers to make a collocated join. Same for Case 2 as T2 builds hash tables without requiring an extra stage of data shuffle which creates data redundancy and unnecessary network cost.

4.2 Planning Optimization / ConnectorMetadata:

This feature is specific to a cloud storage environment capable of handling high concurrent reads, hence, the logic to define a table property as a “replicated reads table” (RRT) is implemented at the level of connector metadata. That is, it is decided per individual connector’s characteristics and storage media type if the RRT can be adopted for performance gain. For current implementation, the focus is on data stored in Amazon S3 using Hive. The support for data access on Azure Blob, Google Cloud BigStorage and many other cloud storages will be added later in extension to this feature.
During the query planning phase, the optimizer creates the table's layout from the catalog. Given it is a Hive connector based, a new logic is added in HiveMetadata to determine if the table being processed is eligible to mark as a RRT table. To determine if a dim table is satisfied as a RRT table, the following conditions are required:

hive.overwrite-high-bandwidth-storage-for-replicated-reads is enabled.

Table is stored in S3 FS or Caching FS that extends S3 FS

The Interfaces to set cloud table property is through following modules via ConnectorTableLayout object:
- ConnectorPlanOptimizer: HiveFilterPushdown
- Planner: PickTableLayout
To define a dimension table that is eligible to use replicated reads join, there are 2 decision layouts to follow.
- The 1st step is decided by the catalog. Metastore contains file system type indicating where the file of the table is located. Current design phase only supports tables stored on AWS S3, meaning the file system type can only be PrestoS3FileSystem or CachingPrestoS3FileSystem. If the file format satisfies, the metadata returns a table layout with a unique table handle indication that the table is a cloud table which means it will be eligible for replicated reads optimization. If the file format does not satisfy the S3 type, metadata returns a table layout as a regular table without any indication in the table handle.
- 2nd step is decided by the optimizer which can be controlled by the system session property enable_colocated_join_for_dim_replicate_table

4.3 Optimizer

Optimizer decides to use broadcast join based upon table size being smaller than the threshold bounded by join-max-broadcast-table-size which by default is 100MB. In most cloud production cluster configuration, max-split-size is set to 250MB. Hence, the total number of splits for each dimension table is only portioned to the number of workers allocated in the cluster. E.g. If the cluster contains 100 workers, the scheduler starts tasks to send 1 split per worker per dimension table file which submits 100 GET requests to cloud storage concurrently, assuming there is only 1 file per dimension table.
Replicated reads table does not need to be configured with partition information, but it must have a definition in ConnectorPartitioningHandle for data co-location with other tables. Today, Presto schedules a worker to scan from the source using SystemPartitioningHandle to read the dimension data followed by an exchange operation to broadcast the entire data to all available workers so that the downstream consumer is in a separate stage. This is optimized to use a Hive compatible partitioning handle for RRT table to extract rows from hive storage. This is very similar to virtual bucketing for hive bucketed tables, however, virtual bucketing requires a matching bucket number for joins in order to have addressable splits for recoverable grouped execution. This is not feasible for RRT table. Hence, RRT table is now using the property of the HivePartitioningHandle which contains several files as the bucket count, a list of hive column types from the dim table, a bucket function type in HIVE_COMPATIBLE, and an indicator of if this partitioning handle is used for dim replicated table. This partitioning handle is used when physical planning decides how to fragment the logical plan into stages. If the partitioning schemes from both sides of join, both join legs stay in the same stage.
Planner rule AddExchange is changed in Optimizer to avoid planning Remote Source from ExchangeClient when replicated reads are allowed and cloud tables are determined to be eligible.

PlanFragmenter decides if it needs to use an alternative table handle if the stage involves a replicated read join and source distribution is encountered. Similar concept as grouped execution for bucketed co-located join, replicate read table creates a virtual bucketed property for which its bucket count is the same as the number of stored files. This is useful for a split loader to load the split per file. In the below example, the table on the right has a single file which is read and kept in cache on all workers so the fact table on the left can collocate join. So, within the same stage, the scheduler assigns a task with the same number of splits on the right table to each worker. So that the hash table built from the right table’s bucket is stored in the memory and streams the probe side of the table's data in the same stage as shown in the figure below.

For cloud tables with replicated reads, we will apply node affinity policy to available nodes for replicate reads across all workers.

4.4 Scheduler (FixedSourcePartitionedScheduler)

4.4.1 Scheduling flow

The coordinator distributes the physical plan to workers, starts execution of tasks and begins to enumerate splits as a bunch of addressable chunks of data in an external storage system. Scheduler needs to understand the property of the split whether or not the splits are required to send to all workers. This requires a flag added in splitSource interface to indicate if the split is qualified for replicated reads. Depending on this flag information, FixedSourcePartitionedScheduler will create a new ReplicatedReadsSplitPlacementPolicy dedicated to handle split assignment maps to all available bucket nodes. This allows each task to select nodes for split distribution correctly.
When using the extensive caching optimization powered by RaptorX, replicated reads tables can be stored and persistent in local SSD in compliance to scan-resistant caching policy. This is achieved by passing the table’s replicate reads property from table handle to ConnectorBucketNodeMap. If it’s a cloud table, we enable caching property in FixedBucketNodeMap to indicate the split is cacheable which is used to determine if the current file is eligible for local cache.

4.4.2 Schedule splits

Schedule splits for RRT table

First scheduler needs to understand the property of the split whether or not the splits are required to send to all workers. So it queries the connector and figures out how many splits to process, during which we mark each split with a notation that if the split is for a cloud table and may use replicated reads. This flag is added in splitSource interface.
Node selection strategy for connector splits is following: The node affinity of a split for
- cloud table: hard_affinity: Split is NOT remotely accessible and has to be on specific nodes
- probe table: soft_affinity: Connector split provides a list of preferred nodes for the executors to pick from but not mandatory.
Scheduler picks up a list of preferred nodes, and based on priority from the provided nodes (if queued or busy) scheduler will pick the nodes if the provided nodes are not busy. We define in split source(lazySplitSource) with a flag to mark if table can be replicated among all workers NodeScheduler define a new SplitPlacementPolicy called ReplicatedReadsSplitPlacementPolicy along with a SplitPlacementResult method to handle cloud replicated table for selecting nodes to distribution data for broadcast join. This new policy is dedicated for replicated reads to assign each split of file splits to all workers.

5. Performance

In this session, we present performance results that demonstrate the impact of this optimization.

5.1 Environment

r5.8xlarge configuration: (32vCPU | 256Gb RAM | 10Gbps | 6800Mbps EBS) Presto coordinator: r5.8xlarge x 1 instance Presto worker: r5.8xlarge x 16 instances Hive Metastore: m5.xlarge IO Cache(if present): gp2 SSD (384GB | IOPS 1152 Mbps) powered by RaptorX Workload: TPCDS sf10000 no partition on S3

5.2 Microbenchmark

5.2.1 Overall Performance

In Cold runs, there is no significant difference for both runs with cache and without cache in cold runs, mostly performance is achieved up to par.

In Warm runs, the performance gain is between 10% to 31% for no cache.

In Warm runs, the performance gain is between 48% to 57% for cache enabled on top of RaptorX

5.3 Sample 10TB TPC-DS runs

5.3.1 Overall Performance

As shown above, without caching, the performance is up to par, and some queries are performing better in cold runs.
When running with RaptorX, we could get the performance gain up to 18%.
One of reasons for small performance gain is because of the fact that in the past, the scheduler builds a topology graph and assigns tasks for that stage to worker nodes. For shared-nothing deployment, the scheduler uses the connector data layout API to decide task placement for leaf stages. And when a task in a leaf stage begins execution on a worker node, the node makes itself available to receive one or more splits. With replicated reads enabled, all these leaf stages that were meant for split scheduling are now merged as a single leaf stage for multiple split assignments. Workers maintain a queue of splits they are assigned to process without having a good order of priority if multiple broadcast joins are bundled in that leaf stage. An optimization in the scheduler is needed to build a topological split per each worker.

6. Operation Considerations

Given that there will be potentially much more splits to schedule, the best practice to tune the scheduler is to configure with the following properties in place:

presto-catalog:

hive.split-loader-concurrency=64 <-- this may need to be changed dynamically

presto-coordinator-etc:

task.concurrency=32 query.min-schedule-split-batch-size=4000 node-scheduler.max-splits-per-node=4000 node-scheduler.max-pending-splits-per-task=4000

Need to consider a change of execution_policy='all-at-once' given that it will minimize wall clock time and schedule all stages of execution concurrently. This change is recommended since we are reducing the number of execution stages to allow directed data flow and optimized execution in topological order.

7. Future Work

Add support to build topological split alignment. When joins of multiple dimension tables are in same stage, all tasks for replicated tables are assigned to execute the splits without prioritizing the build hash table sequence
More robust scan-resistance cache policy
Optimized Execution Policy: allAtOnce vs. Phased

8. Implementation

The working in progress PR will be listed here. Note that it's not ready for review yet

prestodb / presto

Query Execution Optimization for Broadcast Join by Replicated-Reads Strategy #17619