Open prataprc opened 4 years ago
To index 1 billion keys we may need a footprint of 50GB of memory.
Having a billion keys (64-bit integers) in memory would require 8GB. A billion keys is a quite a lot.
An obvious solution is sharding... divide your data set into subsets, and process each set separately. This would give you several small filters... which is maybe more manageable.
But what if you still want to construct a gigantic, out-of-memory, filter?
Then I think we want to use a sorting-based approach (because sorting can be done out-of-memory with relative ease).
How might this work? I think it is best expressed in code, but I think I can describe the idea. For each key, you have h1(k), h2(k), h3(k). Sort your keys by h1(k). Then go through the sorted list and find all occurrences when the h1(k) value occurs once. Remove these keys from the list and add them to a stack. Then sort what remains with respect to h2(k). And so forth.
If you implement it with some care, it could use no more memory than the size of the input... I guess...
You can use radix sort, so the sort routine can be in linear time (though this might be at the expense of memory).
cc @thomasmueller
Xor8 filter is immutable data-structure that generates a bitmap index for given set of unique keys. The generated bitmap index has a footprint of ~ 9 bits for each key. And the sizing requirement for building the index can be understood with following example.
Keys are u64 types and we cannot incrementally add new keys to existing bitmap index. So let us say we need to create a bitmap index for 1 million keys:
keys
asVec<u64>
= 8MBstack
of keys.len() asVec<KeyIndex>
= 12MBq0
of block_length asVec<KeyIndex>
= 4.8MBq1
of block_length asVec<KeyIndex>
= 4.8MBq2
of block_length asVec<KeyIndex>
= 4.8MBsets0
of block_length asVec<XorSet>
= 4.8MBsets1
of block_length asVec<XorSet>
= 4.8MBsets2
of block_length asVec<XorSet>
= 4.8MBfinger_print
= 1.23 MBblock_length ~= (keys.len() * 1.23) / 3
A total of 50MB memory footprint is required for indexing 1 Million keys.
To index 1 billion keys we may need a footprint of 50GB of memory.