Enable semi-join optimization for left-deep trees through bloom filters

saketj commented 8 years ago

This PR adds support for semi-join optimization for left-deep trees through the use of bloom filters. The motivation behind this PR is that often the bottom-most join in a left-deep tree returns maximum number of tuples from the fact table, a majority of which are then discarded by joins up the tree. If somehow, the selectivity information from up the tree could be passed down to the bottom-most operators, we could avoid the cost of potentially passing and storing huge number of unnecessary tuples during query execution.

Bloom filters are lightweight data structures that can help us push down this selectivity information down the operator tree. Therefore, the key idea is to delay the processing of the bottom-most probe operator on the fact table, until bloom filters on all the build operators on all the dimension tables are built. Then pass the bloom filters sideways to this probe operator, which can use these bloom filters to weed out the unnecessary tuples.

This feature is controlled by GFLAGs and is turned off by default. Pass the --optimize_joins flag from the command-line to explicitly enable this optimization.

On a typical left-deep join query like SSB Query 4, this optimization reduces the query execution time by almost 40%.

saketj commented 8 years ago

@hbdeshmukh Thanks Harshad for the comments. I have addressed them now. Sure, I will be there around in the afternoon at the department. We can discuss the optimization function.

hbdeshmukh commented 8 years ago

I and @saketj had a discussion. Saket will improve the documentation and variable names for better clarity. We also had a discussion on how to write tests for this code.

pateljm commented 8 years ago

Sounds good @hbdeshmukh. Thanks!

Thanks @saketj -- This is an awesome feature!

zuyu commented 8 years ago

@saketj I was wondering, as an alternative to add a new execution_heuristics_, we just treat this as a new method called optimizeHashJoin in the existing ExecutionGeneerator.

saketj commented 8 years ago

@hbdeshmukh and I had discussion on how we can reduce lock contention between various worker threads during bloom filter insertion. An improved locking mechanism is now implemented in the latest commit.

saketj commented 8 years ago

@zuyu Thanks Zuyu for the comments. I am working on incorporating them into this PR. As for ExecutionHeuristics class, I can foresee it as something that will grow to incorporate many more complex optimization logic, along with its own data structures. For example, the logic for identification of left-deep trees is currently simple and ignores some complicated scenarios. This will be work of subsequent PRs, I guess. Also, setting bloom filter properties based on the relation statistics is another component that has been introduced.

saketj commented 8 years ago

@zuyu @hbdeshmukh I have addressed all of the comments now. I have also added the unit tests for this PR at query_optimizer/tests/ExecutionHeuristics_unittest.cpp. Please let me know if this all looks good and if we can merge. There is one more incremental feature that builds on top of this PR and it adds the ability to push the bloom filters even further down the tree to a selection operator preceding a probe. I have the changes in my local copy and I am planning to immediately open that PR, once this gets merged. Thanks.

hbdeshmukh commented 8 years ago

FYI, I cancelled the Travis build.

hbdeshmukh commented 8 years ago

LGTM, merging.

vmware-archive / quickstep

Enable semi-join optimization for left-deep trees through bloom filters #195