heavyai / heavydb

HeavyDB (formerly OmniSciDB)
https://heavy.ai
Apache License 2.0
2.94k stars 446 forks source link

Bloom filter #707

Open gunlinan opened 2 years ago

gunlinan commented 2 years ago

Any support or plan to support of bloom filter for unique constraint, equal join and IN & LIKE operators on both native and foreign storages? (ref: https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/src/parquet/bloom_filter.cc)

cdessanti commented 2 years ago

Hi,

AFAIK we don't use any probabilistic index right now. We do use a bit vector for certain types of operators with IN, but it's deterministic.

Regards, Candido

gunlinan commented 2 years ago

Deterministic index is awesome.

When is the index built? Is it available for skipping fragments on storage?

cdessanti commented 2 years ago

When is the index built? Is it available for skipping fragments on storage?

We build a bitmap for IN operations, so in a query like this

select col1,col2,avg(col5) from tab1 where col3 in (select col1 from tab2 where col2 between 1 and 50)

a bitmap index with the range on min/max of col1 after the col2 filter on table tab2 is built as first step of the query, and the tab1 will be filtered testing the col3 values against the bitmap index.

To skip fragments on storage we are using the min/max value of the column that we store for each chunk; here probably the bloom filter would be a nice idea on equality cheks, but I am not sure how much would be effective, given the increased maintenance costs of the metadata

the part of the code that skip the fragment is here

https://github.com/omnisci/omniscidb/blob/d2f719934ebfd421c00d3bce83a2b0cda0ef802f/QueryEngine/Execute.cpp#L3771

I have to find out the part of the code that set the statistics of the chunks; never worked on that part of code

Regards, Candido