An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.
423
stars
115
forks
source link
Investigate use of Join Index Rule V2 when #buckets on indexes on both sides of join are different #237
Open
apoorvedave1 opened 3 years ago
Describe the issue
Problem1
Discussion thread: Thanks @imback82 for pointing this out: https://github.com/microsoft/hyperspace/pull/124#discussion_r514548371
Result:
Neither side is able to utilize bucketing.
Now if we set
spark.sql.shuffle.partitions = 4
, we can use bucketing on the right sideProblem
We need to decide whether or not to add the check
spark.sql.shuffle.partitions == numBuckets
while picking an index based on the following criteria:t1.buckets == t2.buckets
, spark can eliminate shuffle irrespective ofspark.sql.shuffle.partitions
t1.buckets != t2.buckets && spark.sql.shuffle.partitions == t2.buckets
, spark can eliminate t2 side shuffle.Problem2
https://github.com/microsoft/hyperspace/pull/124#discussion_r515026784
Do we decide to NOT use index for Binary nodes where it doesn't make sense?
To Reproduce
Expected behavior
Environment