Data Skipping Index Part 3-2: Rule

clee704 commented 3 years ago

What is the context for this pull request?

Tracking Issue: #441
Parent Issue: N/A
Dependencies: #491

What changes were proposed in this pull request?

Implement the data skipping index application rule.

Does this PR introduce any user-facing change?

Yes, users can create data skipping indexes that can be applied to filter queries.

import com.microsoft.hyperspace.Hyperspace
import com.microsoft.hyperspace.index.dataskipping.DataSkippingIndexConfig
import com.microsoft.hyperspace.index.dataskipping.sketches.MinMaxSketch

spark.range(100).toDF("A").write.parquet("X")
val df = spark.read.parquet("X")
val hs = Hyperspace()
hs.createIndex(df, DataSkippingIndexConfig("myind", MinMaxSketch("A")))
hs.explain(df.filter("A = 1"))

=============================================================
Plan with indexes:
=============================================================
Filter (isnotnull(A#271L) AND (A#271L = 1))
+- ColumnarToRow
   +- FileScan Hyperspace(Type: DS, Name: myind, LogVersion: 1) [A#271L] Batched: true, DataFilters: [isnotnull(A#271L), (A#271L = 1)], Format: Parquet, Location: DataSkippingFileIndex[file:/home/chungmin/Repos/spark3.1/X], PartitionFilters: [], PushedFilters: [IsNotNull(A), EqualTo(A,1)], ReadSchema: struct<A:bigint>

=============================================================
Plan without indexes:
=============================================================
Filter (isnotnull(A#271L) AND (A#271L = 1))
+- ColumnarToRow
   +- FileScan parquet [A#271L] Batched: true, DataFilters: [isnotnull(A#271L), (A#271L = 1)], Format: Parquet, Location: InMemoryFileIndex[file:/home/chungmin/Repos/spark3.1/X], PartitionFilters: [], PushedFilters: [IsNotNull(A), EqualTo(A,1)], ReadSchema: struct<A:bigint>

=============================================================
Indexes used:
=============================================================
myind:file:/home/chungmin/Repos/spark3.1/spark-warehouse/indexes/myind/v__=0

How was this patch tested?

Unit test

sezruby commented 3 years ago

Could you split the PR? e.g. part 3-1: utils, part 3-2: apply?

clee704 commented 3 years ago

Thanks for the detailed review!

microsoft / hyperspace

Data Skipping Index Part 3-2: Rule #482

What is the context for this pull request?

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?