Closed clee704 closed 3 years ago
A new data skipping sketch type BloomFilterSketch is added.
BloomFilterSketch
import com.microsoft.hyperspace.Hyperspace import com.microsoft.hyperspace.index.dataskipping.DataSkippingIndexConfig import com.microsoft.hyperspace.index.dataskipping.sketches.{BloomFilterSketch, MinMaxSketch} spark.range(100).selectExpr("id as A", "id * 2 as B").write.parquet("X") val df = spark.read.parquet("X") val hs = Hyperspace() hs.createIndex(df, DataSkippingIndexConfig("myind", MinMaxSketch("A"), BloomFilterSketch("B", 0.01, 10))) val indexDataPath = hs.index("myind").select("indexLocation").collect().head.getString(0) spark.read.parquet(indexDataPath).show hs.explain(df.filter("B = 1"))
+-------------+-----------+-----------+--------------------------+ |_data_file_id|MinMax_A__0|MinMax_A__1|BloomFilter_B__0.01__10__0| +-------------+-----------+-----------+--------------------------+ | 10| 75| 82| {7, 47, [-8781728...| | 6| 41| 49| {7, 50, [96460795...| | 4| 58| 65| {7, 44, [57871781...| | 3| 91| 99| {7, 53, [-7349426...| | 9| 0| 7| {7, 44, [47615823...| | 11| 83| 90| {7, 49, [-6606656...| | 8| 50| 57| {7, 49, [11259074...| | 5| 66| 74| {7, 54, [40415990...| | 0| 25| 32| {7, 45, [49736543...| | 2| 16| 24| {7, 49, [-4128164...| | 1| 33| 40| {7, 45, [-8340072...| | 7| 8| 15| {7, 47, [-8916036...| +-------------+-----------+-----------+--------------------------+
============================================================= Plan with indexes: ============================================================= Filter (isnotnull(B#9L) AND (B#9L = 1)) +- ColumnarToRow +- FileScan Hyperspace(Type: DS, Name: myind, LogVersion: 1) [A#8L,B#9L] Batched: true, DataFilters: [isnotnull(B#9L), (B#9L = 1)], Format: Parquet, Location: DataSkippingFileIndex[file:/home/chungmin/Repos/hyperspace4/X], PartitionFilters: [], PushedFilters: [IsNotNull(B), EqualTo(B,1)], ReadSchema: struct<A:bigint,B:bigint> ============================================================= Plan without indexes: ============================================================= Filter (isnotnull(B#9L) AND (B#9L = 1)) +- ColumnarToRow +- FileScan parquet [A#8L,B#9L] Batched: true, DataFilters: [isnotnull(B#9L), (B#9L = 1)], Format: Parquet, Location: InMemoryFileIndex[file:/home/chungmin/Repos/hyperspace4/X], PartitionFilters: [], PushedFilters: [IsNotNull(B), EqualTo(B,1)], ReadSchema: struct<A:bigint,B:bigint> ============================================================= Indexes used: ============================================================= myind:file:/home/chungmin/Repos/hyperspace4/spark-warehouse/indexes/myind/v__=0
Yes, users can now create data skipping indexes with bloom filters.
Unit test
What is the context for this pull request?
What changes were proposed in this pull request?
A new data skipping sketch type
BloomFilterSketch
is added.Does this PR introduce any user-facing change?
Yes, users can now create data skipping indexes with bloom filters.
How was this patch tested?
Unit test