microsoft / hyperspace

An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.
https://aka.ms/hyperspace
Apache License 2.0
423 stars 115 forks source link

Data Skipping Index Part 4: BloomFilterSketch #483

Closed clee704 closed 3 years ago

clee704 commented 3 years ago

What is the context for this pull request?

What changes were proposed in this pull request?

A new data skipping sketch type BloomFilterSketch is added.

import com.microsoft.hyperspace.Hyperspace
import com.microsoft.hyperspace.index.dataskipping.DataSkippingIndexConfig
import com.microsoft.hyperspace.index.dataskipping.sketches.{BloomFilterSketch, MinMaxSketch}

spark.range(100).selectExpr("id as A", "id * 2 as B").write.parquet("X")
val df = spark.read.parquet("X")
val hs = Hyperspace()
hs.createIndex(df, DataSkippingIndexConfig("myind", MinMaxSketch("A"), BloomFilterSketch("B", 0.01, 10)))
val indexDataPath = hs.index("myind").select("indexLocation").collect().head.getString(0)
spark.read.parquet(indexDataPath).show
hs.explain(df.filter("B = 1"))
+-------------+-----------+-----------+--------------------------+
|_data_file_id|MinMax_A__0|MinMax_A__1|BloomFilter_B__0.01__10__0|
+-------------+-----------+-----------+--------------------------+
|           10|         75|         82|      {7, 47, [-8781728...|
|            6|         41|         49|      {7, 50, [96460795...|
|            4|         58|         65|      {7, 44, [57871781...|
|            3|         91|         99|      {7, 53, [-7349426...|
|            9|          0|          7|      {7, 44, [47615823...|
|           11|         83|         90|      {7, 49, [-6606656...|
|            8|         50|         57|      {7, 49, [11259074...|
|            5|         66|         74|      {7, 54, [40415990...|
|            0|         25|         32|      {7, 45, [49736543...|
|            2|         16|         24|      {7, 49, [-4128164...|
|            1|         33|         40|      {7, 45, [-8340072...|
|            7|          8|         15|      {7, 47, [-8916036...|
+-------------+-----------+-----------+--------------------------+
=============================================================
Plan with indexes:
=============================================================
Filter (isnotnull(B#9L) AND (B#9L = 1))
+- ColumnarToRow
   +- FileScan Hyperspace(Type: DS, Name: myind, LogVersion: 1) [A#8L,B#9L] Batched: true, DataFilters: [isnotnull(B#9L), (B#9L = 1)], Format: Parquet, Location: DataSkippingFileIndex[file:/home/chungmin/Repos/hyperspace4/X], PartitionFilters: [], PushedFilters: [IsNotNull(B), EqualTo(B,1)], ReadSchema: struct<A:bigint,B:bigint>

=============================================================
Plan without indexes:
=============================================================
Filter (isnotnull(B#9L) AND (B#9L = 1))
+- ColumnarToRow
   +- FileScan parquet [A#8L,B#9L] Batched: true, DataFilters: [isnotnull(B#9L), (B#9L = 1)], Format: Parquet, Location: InMemoryFileIndex[file:/home/chungmin/Repos/hyperspace4/X], PartitionFilters: [], PushedFilters: [IsNotNull(B), EqualTo(B,1)], ReadSchema: struct<A:bigint,B:bigint>

=============================================================
Indexes used:
=============================================================
myind:file:/home/chungmin/Repos/hyperspace4/spark-warehouse/indexes/myind/v__=0

Does this PR introduce any user-facing change?

Yes, users can now create data skipping indexes with bloom filters.

How was this patch tested?

Unit test