Broadcast join doesn't need to shuffle all data, so applying index might cause some regression, because of the bucketed read.
Since join query always has "isnotnull" condition for the key column, the index will be applied by FilterIndexRule which is non-bucketed read.
If it's non-bucketed read, spark can parallelize the job depending on dataset size; it has a less possibility of regression than bucketed read in case of BHJ
What is the context for this pull request?
What changes were proposed in this pull request?
Add a condition for JoinIndexRule.
Broadcast join doesn't need to shuffle all data, so applying index might cause some regression, because of the bucketed read. Since join query always has "isnotnull" condition for the key column, the index will be applied by FilterIndexRule which is non-bucketed read. If it's non-bucketed read, spark can parallelize the job depending on dataset size; it has a less possibility of regression than bucketed read in case of BHJ
Check the join type using JoinSelection - https://github.com/apache/spark/blob/3ba57f5edc5594ee676249cd309b8f0d8248462e/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L182
Does this PR introduce any user-facing change?
Yes, if a join is performed by broadcast join, index will be applied by FilterIndexRule.
How was this patch tested?
Unit test