microsoft / hyperspace

An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.
https://aka.ms/hyperspace
Apache License 2.0
424 stars 115 forks source link

Apply JoinIndexRule only for SortMergeJoin #502

Closed sezruby closed 3 years ago

sezruby commented 3 years ago

What is the context for this pull request?

What changes were proposed in this pull request?

Add a condition for JoinIndexRule.

Broadcast join doesn't need to shuffle all data, so applying index might cause some regression, because of the bucketed read. Since join query always has "isnotnull" condition for the key column, the index will be applied by FilterIndexRule which is non-bucketed read. If it's non-bucketed read, spark can parallelize the job depending on dataset size; it has a less possibility of regression than bucketed read in case of BHJ

Check the join type using JoinSelection - https://github.com/apache/spark/blob/3ba57f5edc5594ee676249cd309b8f0d8248462e/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L182

Does this PR introduce any user-facing change?

Yes, if a join is performed by broadcast join, index will be applied by FilterIndexRule.

How was this patch tested?

Unit test