prataprc / xorfilter

Rust library implementing xor-filters
Apache License 2.0
137 stars 18 forks source link

Handle DGM scenario for Xor8 filter. #9

Open prataprc opened 4 years ago

prataprc commented 4 years ago

Xor8 filter is immutable data-structure that generates a bitmap index for given set of unique keys. The generated bitmap index has a footprint of ~ 9 bits for each key. And the sizing requirement for building the index can be understood with following example.

Keys are u64 types and we cannot incrementally add new keys to existing bitmap index. So let us say we need to create a bitmap index for 1 million keys:

keys as Vec<u64> = 8MB stack of keys.len() as Vec<KeyIndex> = 12MB q0 of block_length as Vec<KeyIndex> = 4.8MB q1 of block_length as Vec<KeyIndex> = 4.8MB q2 of block_length as Vec<KeyIndex> = 4.8MB sets0 of block_length as Vec<XorSet> = 4.8MB sets1 of block_length as Vec<XorSet> = 4.8MB sets2 of block_length as Vec<XorSet> = 4.8MB finger_print = 1.23 MB

block_length ~= (keys.len() * 1.23) / 3

A total of 50MB memory footprint is required for indexing 1 Million keys.

To index 1 billion keys we may need a footprint of 50GB of memory.

lemire commented 4 years ago

To index 1 billion keys we may need a footprint of 50GB of memory.

Having a billion keys (64-bit integers) in memory would require 8GB. A billion keys is a quite a lot.

An obvious solution is sharding... divide your data set into subsets, and process each set separately. This would give you several small filters... which is maybe more manageable.

But what if you still want to construct a gigantic, out-of-memory, filter?

Then I think we want to use a sorting-based approach (because sorting can be done out-of-memory with relative ease).

How might this work? I think it is best expressed in code, but I think I can describe the idea. For each key, you have h1(k), h2(k), h3(k). Sort your keys by h1(k). Then go through the sorted list and find all occurrences when the h1(k) value occurs once. Remove these keys from the list and add them to a stack. Then sort what remains with respect to h2(k). And so forth.

If you implement it with some care, it could use no more memory than the size of the input... I guess...

You can use radix sort, so the sort routine can be in linear time (though this might be at the expense of memory).

cc @thomasmueller