trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.27k stars 2.96k forks source link

Skew Join Optimizer in Presto #1261

Open rupamk opened 5 years ago

rupamk commented 5 years ago

Consider following query.

SELECT * from B JOIN A WHERE B.x = A.x

Consider a case, where the Table B data for column x is unevenly distributed and skewed for a specific key. A skewed distribution of keys manifests itself in skewed distribution of workload. As a result, some of the nodes joining the most popular keys will perform heavy computation compared to others. Handling the skewed key becomes a bottleneck for the entire JOIN stage which effectively delays the overall execution of the Hash Join operation.

Problem: How can we evenly distribute the workload for a JOIN (INNER) operation with skewed keys ? ”

The above query can be rewritten to redistribute the workload as:

1.

(SELECT *, RAND(n) as newColumnLeft from B) as T1
join
(
select *, 0 as newColumnRight from A
UNION
select *, 1 as newColumnRight from A
UNION
.
.
select *, n-1 as newColumnRight from A
) as T2
WHERE T1.x = T2.x AND T1.newColumnLeft = T2.newColumnRight

Drawback: Lots of replication for the non-skewed keys.

  1. Information about the skewed keys can be used to minimize the replication. assume for now, the keys (1, 3) from column x belonging to table B are skewed and the replication factor is assumed to be 3. The query can be rewritten to redistribute the most popular key further as follows:
SELECT * from
(SELECT *, RAND(3) as newColumnLeft from B where B.x in (1,3)
UNION
SELECT *, 0 as newColumnLeft from B where B.x NOT in (1,3)
) as T1 
JOIN
(SELECT *, 0 as newColumnRight from A where A.x in (1, 3)
UNION
SELECT *, 1 as newColumnRight from A.x in (1, 3) 
UNION
SELECT *,2 as newColumnRight from A.x in (1, 3)
UNION
SELECT *, 0 as newColumnRight from A where A.x NOT in (1, 3)) as T2
WHERE 
T1.x = T2.x And T1.newColumnLeft = T2.newColumnRight

For evaluation, two tables are created (Table definition in Appendix 2, https://docs.google.com/document/d/1gmzz8TsuVl9W2VK5sdorUw_sN1W1gWzw_oL4QKgAETM/edit?usp=sharing). Bigger Table has two fields, one of which is skewed on one key, the skewfactor is varied as shown below. The configuration for the cluster is as follows:

Cluster Configs: Master Node Type: m3.2xlarge - 8 cores, 30GiB mem Worker Node Type: m3.2xlarge - 8 cores, 30GiB mem Min/Max Workers: 5 (On Demand Nodes)

Bigger Table: # of Rows Smaller Table: # of Rows Skew % of the most popular key Without Optimization With Optimization With Broadcast Join
11000000 1000000 10% 44 minutes 20 minutes 50 minutes
11200000 1000000 12% 67 minutes 27 minutes 63 minutes
11500000 1000000 15% 87 minutes 32 minutes 100 minutes

Drawback: Appending the new column, there are multiple TableScans for the same table which can introduce latency.

  1. To avoid the multiple table scans, a solution is proposed here: https://docs.google.com/document/d/1gmzz8TsuVl9W2VK5sdorUw_sN1W1gWzw_oL4QKgAETM/edit?usp=sharing (Work in Progress)

Highlights from the doc: a. Hint framework to get inputs from user (can be extended beyond Skew Join Implementation) b. Explode Operator (used here for replicating the smaller table, can be extended beyond Skew Join Implementation)

sopel39 commented 5 years ago

Interesting idea.

Explode Operator

There is similar to GroupIdOperator operator in Presto. It is used to generate input for grouping set aggregation. We also multiplicate pages there (for each grouping set).

a. Hint framework to get inputs from user (can be extended beyond Skew Join Implementation)

I'm not sure that would be the way to go. We rather try to avoid user hinting in Presto because it creates legacy syntax which is hard to remove later on. It would be great if CBO could help in such case automatically.

Would it be possible if for your benchmarks you also attached results with BROADCAST join enabled?

rupamk commented 5 years ago

Would it be possible if for your benchmarks you also attached results with BROADCAST join enabled? Yes I will update the Doc.

rupamk commented 5 years ago

I'm not sure that would be the way to go. We rather try to avoid user hinting in Presto because it creates legacy syntax which is hard to remove later on. It would be great if CBO could help in such case automatically.

One more question, how is CBO going to help here, since we want explicitly the skewed keys?

sopel39 commented 5 years ago

One more question, how is CBO going to help here, since we want explicitly the skewed keys?

We could improve CBO so that it explicitly keeps track of outlier (skewed) values.

rongrong commented 5 years ago

If we could know which keys are skewed at planning time broadcast join the skewed keys union partition join the rest is the best solution. It's very hard to know skewed keys though.

rupamk commented 5 years ago

We could improve CBO so that it explicitly keeps track of outlier (skewed) values.

We can plug in CBO too but the execution part detailed in this proposal would still be needed. So I plan to start with this execution work, the better ways to provide input to it via CBO could follow

rupamk commented 5 years ago

Would it be possible if for your benchmarks you also attached results with BROADCAST join enabled?

updated the results in the doc as well as here.

sopel39 commented 5 years ago

We can plug in CBO too but the execution part detailed in this proposal would still be needed. So I plan to start with this execution work,

We don't want this code to be dead by any means. @martint @findepi @kokosing Do you have some other ideas (apart from hints and improving CBO) for passing outlier values to this optimization?

Unfortunately we cannot dynamically detect outliers at this point.

kokosing commented 5 years ago

We don't want this code to be dead by any means. @martint @findepi @kokosing Do you have some other ideas (apart from hints and improving CBO) for passing outlier values to this optimization?

Maybe we could model this a some session property for now? Like:

SET SESSION skewed_columns = ['catalogA.schemaA.tableA.columnA', 'catalogA.schemaA.tableB.columnB'];

So effectively this is a hint, however then in future we could then try to automatically detect such columns and then remove this flag or leave it. Notice that, it does not require us to introduce SQL syntax "enhancements" to support hints.

sopel39 commented 5 years ago

Alternatively we could store skew info as column property

findepi commented 5 years ago

@rupamk i think Hive/HMS has some means to "annotate" skewed columns. We could piggy back on this.

martint commented 5 years ago

Maybe we could model this a some session property for now? Like:

Yes, that's a good way to approach the problem. We do something similar for distributed geospatial joins and grouped execution. We can have the functionality in with a way for users to benefit from it "manually" and then over time make Presto smarter (by improving the optimizer, etc), to do it automatically.

kokosing commented 5 years ago

Alternatively we could store skew info as column property

Not every connector support column properties, but all connectors support skewness ;)

grantatspothero commented 4 years ago

Any update on skew join support to the CBO + the hive connector? Would be really helpful for scenarios where you are joining on a column with lots of nulls.

A "manual" approach with a session property (or something) would also be appreciated since having the CBO optimizer support will probably take much longer to implement.

rupamk commented 4 years ago

@stagraqubole @vijaymann

damnMeddlingKid commented 2 years ago

I have a PR to implement a scoped down version of this https://github.com/trinodb/trino/pull/13401