mpope9 / exor_filter

Erlang nif for xor_filter. 'Faster and Smaller Than Bloom and Cuckoo Filters'.
Apache License 2.0
31 stars 3 forks source link

Incremental value_list construction #12

Closed Vagabond closed 3 years ago

Vagabond commented 4 years ago

For constructing very large filters, it might be nice to have a way to allocate a 'value list' resource along the lines of

struct vl_resource {
   size_t size;
   size_t count;
   uint64_t values;
}

Then you could load keys into it a few at a time and once you were done you could populate the xor filter and free the resource. This would avoid storing all the unhashed/hashed keys in Erlang land and allow them to only be retained in the NIF (where they're most compact). If the value list gets too big you could use realloc() to grow it.

I don't have time to do it now, just trying to capture the thought.

mpope9 commented 4 years ago

This is a good idea. I'll look into this during the weekend.

mpope9 commented 4 years ago

Considering implementing this. An api like:

EmptyFilter = xor8:empty_filter(),
OneElemFilter = xor8:add(EmptyFilter, [1]]),
ThreeElemFilter = xor8:add(OneElemFilter, [2, 3]]),
Filter = xor8:finalize(ThreeElemFilter),

would make sense? Add some guarding to prevent a pre-finialized filter from working with contain / free. It's a little unweidly to require finalizing, but then we wouldn't need to keep track of the int list and the filter during its whole life cycle. Any thoughts on this?

mpope9 commented 4 years ago

Also, are you using this lib? I see some traffic recently and am curious on how it is working...

Vagabond commented 4 years ago

Yes, we're using it, although not to its full extent yet. It seems to work quite well.

I don't object to the API proposed above, it's pretty similar to how some of the compression and cryptographic libraries work when you want to use them in a streaming mode.

mpope9 commented 3 years ago

This has been added. Will try to get around to supporting deduplication of input soon. If dups are added then finalize will fail.

mpope9 commented 3 years ago

The library now auto-dedups input for the incremental api.