microsoft / hyperspace

An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.
https://aka.ms/hyperspace
Apache License 2.0
424 stars 115 forks source link

Data Skipping Index Part 3-2: Rule #482

Closed clee704 closed 3 years ago

clee704 commented 3 years ago

What is the context for this pull request?

What changes were proposed in this pull request?

Implement the data skipping index application rule.

Does this PR introduce any user-facing change?

Yes, users can create data skipping indexes that can be applied to filter queries.

import com.microsoft.hyperspace.Hyperspace
import com.microsoft.hyperspace.index.dataskipping.DataSkippingIndexConfig
import com.microsoft.hyperspace.index.dataskipping.sketches.MinMaxSketch

spark.range(100).toDF("A").write.parquet("X")
val df = spark.read.parquet("X")
val hs = Hyperspace()
hs.createIndex(df, DataSkippingIndexConfig("myind", MinMaxSketch("A")))
hs.explain(df.filter("A = 1"))
=============================================================
Plan with indexes:
=============================================================
Filter (isnotnull(A#271L) AND (A#271L = 1))
+- ColumnarToRow
   +- FileScan Hyperspace(Type: DS, Name: myind, LogVersion: 1) [A#271L] Batched: true, DataFilters: [isnotnull(A#271L), (A#271L = 1)], Format: Parquet, Location: DataSkippingFileIndex[file:/home/chungmin/Repos/spark3.1/X], PartitionFilters: [], PushedFilters: [IsNotNull(A), EqualTo(A,1)], ReadSchema: struct<A:bigint>

=============================================================
Plan without indexes:
=============================================================
Filter (isnotnull(A#271L) AND (A#271L = 1))
+- ColumnarToRow
   +- FileScan parquet [A#271L] Batched: true, DataFilters: [isnotnull(A#271L), (A#271L = 1)], Format: Parquet, Location: InMemoryFileIndex[file:/home/chungmin/Repos/spark3.1/X], PartitionFilters: [], PushedFilters: [IsNotNull(A), EqualTo(A,1)], ReadSchema: struct<A:bigint>

=============================================================
Indexes used:
=============================================================
myind:file:/home/chungmin/Repos/spark3.1/spark-warehouse/indexes/myind/v__=0

How was this patch tested?

Unit test

sezruby commented 3 years ago

Could you split the PR? e.g. part 3-1: utils, part 3-2: apply?

clee704 commented 3 years ago

Thanks for the detailed review!