Investigate different implementation of ParquetReader

lightcopy / parquet-index

Spark SQL index for Parquet tables

Apache License 2.0

132 stars 35 forks source link

Investigate different implementation of ParquetReader #63

Open sadikovi opened 7 years ago

sadikovi commented 7 years ago

Currently we are using Spark Parquet reader, this issue is about investigating if we can extract data pages and index those including each page statistics. During scan we would select only those pages that match predicate and read data from them.

Questions:

if file is compressed, is it worth doing this read?
How to reconstruct full record with this approach?

sadikovi commented 7 years ago

Another take is multilevel statistics, this will allow to push expensive filter statistics until the very end when we have to evaluate predicate precisely.

sadikovi commented 7 years ago

This approach has its own drawbacks, one is dependency on Parquet version, e.g. issues with statistics in older versions, or reading data pages with skewed stats. For example, you have 2 pages, one contains 1 and 1,000,000 and another contains 2. If you index data pages, you will have to scan the file for query id = 999, even though, there are only 3 values.