Open sadikovi opened 7 years ago
Another take is multilevel statistics, this will allow to push expensive filter statistics until the very end when we have to evaluate predicate precisely.
This approach has its own drawbacks, one is dependency on Parquet version, e.g. issues with statistics in older versions, or reading data pages with skewed stats. For example, you have 2 pages, one contains 1
and 1,000,000
and another contains 2
. If you index data pages, you will have to scan the file for query id = 999
, even though, there are only 3 values.
Currently we are using Spark Parquet reader, this issue is about investigating if we can extract data pages and index those including each page statistics. During scan we would select only those pages that match predicate and read data from them.
Questions: