prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
16.05k stars 5.38k forks source link

Dynamic Filter for Hash Join #15758

Open SilenceJD opened 3 years ago

SilenceJD commented 3 years ago

Join is a very time-consuming operation, especially when there is a large amount of data. Currently, Presto 0.241+ supports local dynamic filtering for broadcast inner-joins, but does not have a dynamic filtering for hash inner-joins. In order to fill this gap, we propose dynamic filter for hash inner-joins.

Compared with Presto dynamic filter and dynamic filter for hash join we proposed, the main differences are as follows.

We have integrated tests for our DF and have a detailed test report. The test results show that we covered 2.30% of SQLs in Bytedance, and 54.62% of which achieved an average of 22.81% time improvement.

Our Design Doc:Presto Dynamic Filter For Hash Join Design Our Test Report:Test Report of Presto Dynamic Filter For Hash Join

highker commented 3 years ago

This is interesting. @SilenceJD, could you help to dump the design doc into a google doc? The link above seems to require registration.

SilenceJD commented 3 years ago

This is interesting. @SilenceJD, could you help to dump the design doc into a google doc? The link above seems to require registration.

Sorry, I've updated the link to Google Doc now.

iceted commented 3 years ago

is this similar to https://trino.io/blog/2020/06/14/dynamic-partition-pruning.html ?

highker commented 3 years ago

Right, this is the same flow in https://trino.io/blog/2020/06/14/dynamic-partition-pruning.html. However, they do have bloom filter to make it more efficient. Partition is a too loose granularity. We actually tried the partition pruning and it was working pretty bad. The queries we have a super fast with seconds or sub seconds latency. The roundtrip can kill the performance. @SilenceJD, wonder in your use cases, what the the latency ranges?

SilenceJD commented 3 years ago

Right, this is the same flow in https://trino.io/blog/2020/06/14/dynamic-partition-pruning.html. However, they do have bloom filter to make it more efficient. Partition is a too loose granularity. We actually tried the partition pruning and it was working pretty bad. The queries we have a super fast with seconds or sub seconds latency. The roundtrip can kill the performance. @SilenceJD, wonder in your use cases, what the the latency ranges?

In our scenarios, some big queries with elapsed time of 60+ seconds. I think this is benefit for big inner join queries.

highker commented 3 years ago

@SilenceJD, could you send out a PR for this? I don't see any other approach that can gather build side digest effectively.

SilenceJD commented 3 years ago

@SilenceJD, could you send out a PR for this? I don't see any other approach that can gather build side digest effectively.

Sure, I will migrate the code to the latest version and send out.

SilenceJD commented 3 years ago

@SilenceJD, could you send out a PR for this? I don't see any other approach that can gather build side digest effectively

https://github.com/prestodb/presto/pull/15763 This is the PR of dynamic filter for hash join.

SilenceJD commented 3 years ago

@highker hi, I‘ve fixed the coding style already, would you mind give more suggestions for the design?

GithubZhitao commented 2 years ago

Is this feature accepted and merged to master ?