Open dai-chen opened 10 months ago
Next we will dive into the Option 3 Query time on-demand build.
Following the item proposed above, here is an example that illustrates the idea:
TODO
TODO
Delta table collects column stats automatically. However, it only collects min-max for numerical, date and string column. Probably because it stores data as Parquet (which uses min-max, dictionary encoding and bloom filter already), Delta only aggregate file-level min-max to Delta table level.
collectStats(MIN, statCollectionPhysicalSchema) {
// Truncate string min values as necessary
case (c, SkippingEligibleDataType(StringType), true) =>
substring(min(c), 0, stringPrefix)
// Collect all numeric min values
case (c, SkippingEligibleDataType(_), true) =>
min(c)
},
object SkippingEligibleDataType {
// Call this directly, e.g. `SkippingEligibleDataType(dataType)`
def apply(dataType: DataType): Boolean = dataType match {
case _: NumericType | DateType | TimestampType | StringType => true
case _ => false
}
Hyperspace also provides analysis utility to help users estimate the effectiveness of Z-Ordering before creation.
scala> import com.microsoft.hyperspace.util.MinMaxAnalysisUtil
import com.microsoft.hyperspace.util.MinMaxAnalysisUtil
scala> println(MinMaxAnalysisUtil.analyze(df, Seq("A", "B"), format = "text"))
Min/Max analysis on A
< Number of files (%) >
+--------------------------------------------------+
100% | |
| |
| |
| |
75% | |
| |
| |
| |
| |
50% | |
| |
| |
| |
| |
25% | |
| |
| |
| |
|**************************************************|
0% |**************************************************|
+--------------------------------------------------+
Min <----- A value -----> Max
min(A): 0
max(A): 99
Total num of files: 12
Total byte size of files: 11683
Max. num of files for a point lookup: 1 (8.33%)
Estimated average num of files for a point lookup: 1.00 (8.33%)
Max. bytes to read for a point lookup: 982 (8.41%)
...
X-axis: it represents the range group of the column values. Y-axis: the percentage of number of files to look up a value based on the minimum and maximum value of each file. So lower percentage means better distribution as we could skip more files.
Is your feature request related to a problem?
When using Flint skipping index, user first needs to create a Spark table and then has to decide what skipping data structure for a column. Afterwards, the freshness of skipping index is maintained by a long running Spark streaming job.
As a user, the pain point includes:
What solution would you like?
Propose idea below and need PoC for each item: