Open gunlinan opened 2 years ago
Hi,
AFAIK we don't use any probabilistic index right now. We do use a bit vector for certain types of operators with IN, but it's deterministic.
Regards, Candido
Deterministic index is awesome.
When is the index built? Is it available for skipping fragments on storage?
When is the index built? Is it available for skipping fragments on storage?
We build a bitmap for IN operations, so in a query like this
select col1,col2,avg(col5) from tab1 where col3 in (select col1 from tab2 where col2 between 1 and 50)
a bitmap index with the range on min/max of col1 after the col2 filter on table tab2 is built as first step of the query, and the tab1 will be filtered testing the col3 values against the bitmap index.
To skip fragments on storage we are using the min/max value of the column that we store for each chunk; here probably the bloom filter would be a nice idea on equality cheks, but I am not sure how much would be effective, given the increased maintenance costs of the metadata
the part of the code that skip the fragment is here
I have to find out the part of the code that set the statistics of the chunks; never worked on that part of code
Regards, Candido
Any support or plan to support of bloom filter for unique constraint, equal join and IN & LIKE operators on both native and foreign storages? (ref: https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/src/parquet/bloom_filter.cc)