kristoff-it / redis-cuckoofilter

Hashing-function agnostic Cuckoo filters for Redis
MIT License
231 stars 22 forks source link

CF.SIZEFOR #4

Open shahasim opened 4 years ago

shahasim commented 4 years ago

Hi,

I wanted to understand the commands associated with the cuckoo filter, apologies some of these question might be very basic.

I am trying to create a filter that would hold 1 billion elements and an FP size of 4.

I have followed the following set of commands to estimate the size and create the filter

CF.SIZEFOR 1000000000 4 EXACT 4G CF.INIT test 4G OK (2.94s) CF.CAPACITY test (error) ERR bad size

I believe the CF.Capacity should return 1 billion, however I get bad size error.

Similarly, when I start loading the values into my test filter, I get "ERR too full", after 12,000 to 16,000 values. So I am interested to know if there is a limitation on the size of the hash and that size of the hash string should be factored in. I am using 64-bit FNV hash to hash strings into numbers

I am using the following setup;

redis_version:5.0.5 programming language: golang redis module used: go-redis cuckoo filter version: libredis-cuckoofilter.so.1.1.0 OS: Red Hat Enterprise Linux Server release 7.6 (Maipo)

Please let me know if you need more info.

Thanks,

kristoff-it commented 4 years ago

Hi shahasim!

Thanks for taking the time to write all the details. The error that you're getting from CF.CAPACITY looks like a bug in my code, I'll take a look as soon as I can.

In the meantime, please double check this:

CF.INIT test 4G

This command will create a 4G filter with 1-byte wide fingerprints. If you used this command in your code, then this would explain the collision problems that you encountered.

To create a filter with 4byte FPs:

CF.INIT test 4G 4

Please let me know if that was a bug in your code or if you just forgot to add the last parameter when writing this issue. I'll let you know about CF.CAPACITY once I take a look.

Ciao!

shahasim commented 4 years ago

Thanks for getting back to me on this;

I've checked the code, I am definitely adding the FPSize at the end of the init command, I just missed it on the issue.

For my understanding please, if I miss the FPsize at the end it should still create a filter with 1 billion elements, the only issue will be that I will get more collisions?

However my problem is that I am getting "ERR too full" after just 16000 entires :(

I have also tried to create a filter with 8G size but still no luck. On that though, as per the documentation the size can be anything between 1K to 8G, is 8G inclusive, as in can I create filter with 8G size or does it have to be less than 8G?

Thanks,

kristoff-it commented 4 years ago

So, I found the bug with CF.CAPACITY. It was in the README.

Long story short, the documentation was out of sync with what the code intended to do (probably an early draft that I forgot to update once I changed direction in the code).

The real signature for the command is the following:

CF.CAPACITY [size] [fpsize]

It's more or less the inverse of CF.SIZEFOR, and as such it doesn't involve any key. I've updated the documentation to reflect this.

For my understanding please, if I miss the FPsize at the end it should still create a filter with 1 billion elements, the only issue will be that I will get more collisions?

A full explanation is a bit long (I tried to write it, then decided it's best to avoid throwing at you a giant wall of text), but you can test manually what happens with the filter. Super long story short, with 4byte long fingerprints it takes you 5 identical items to cause a too full error, and with all other settings it takes 9. Once you cause the first error you will still be able to add more item as long as the hash value is different and it will take one less collision (so respectively 4 and 8) to cause another error. You can test this manually as follows:

cf.init test 1k 4
cf.add test 1 1
cf.add test 1 1
cf.add test 1 1
cf.add test 1 1
cf.add test 1 1
cf.add test 1 1 // returns error
cf.add test 2 1 
cf.add test 2 1
cf.add test 2 1
cf.add test 2 1
cf.add test 2 1 // returns error
cf.add test 3 1
...

So, going back to your comment, in Cuckoo adding the same item twice is not a noop like with Bloom. Cuckoo filters are organized in buckets that can get completely full. The filter is able to rearrange some fingerprints when that happens (hence the name Cuckoo), but that mechanism relies on the fingerprints not being all equal. The previous code is basically an artificial way to fill a section of the filter without giving it a chance to "rebalance" the fingerprints. In real-life usage, this should never happen unless you're explicitly adding to the filter multiple copies of the same item, or going over the filter's capacity.

However my problem is that I am getting "ERR too full" after just 16000 entires :( I have also tried to create a filter with 8G size but still no luck. On that though, as per the documentation the size can be anything between 1K to 8G, is 8G inclusive, as in can I create filter with 8G size or does it have to be less than 8G?

It's a bit embarrassing but it seems you have stumbled upon two dumb mistakes on my part. After the outdated docs, now you have discovered that I uploaded an obsolete .so/.dylib as part of the latest release!

I tried to reproduce your problem with 8G filters by compiling the code on my machine and I couldn't reproduce it. Then I downloaded the latest release and there was the problem.

I also suspect that the collision problem you're having is related to the fact that the release erroneously contains a build based on outdated code.

I just released v1.1.1, would you mind trying it out and seeing if you still have trouble with either 8G filters or collisions?

If you still have problems with collisions, please share your test routine so I can see what's going on.

Thanks!

shahasim commented 4 years ago

Thank you for the quick turnaround on this and detailed explanation, turns out I was also ignoring a key concept re cuckoo filters. I thought the "ERR too full" was only because I was running out of memory/capacity, however it was partly to do with collisions and race conditions in my code :-(

I had chosen FP size of 4bytes to reduce false positives, but this now means I must have fewer collisions compared to the other FPsizes.

I have downloaded the latest release (many thanks :-)) and have changed my code to handle race conditions. I will stress test my code over the next couple of days and keep you posted re the progress.

Thanks again :-)

kristoff-it commented 4 years ago

Sounds great.

If you want a suggestion for your testing suite:

  1. Compute random hash+fp and use a Go map to ensure you don't have duplicates.
  2. Save the computed values to a file.
  3. In a separate script, read the values from file and then load them into Redis.

This way you split the slow part of the testing in a preprocessing stage that you don't need to run multiple times unless you want to generate different test data.

Ciao!

kristoff-it commented 4 years ago

@shahasim any update?

shahasim commented 4 years ago

@kristoff-it , apologies for not responding to you earlier, due of the lack of time we went with this https://github.com/RedisBloom/RedisBloom implementation of the cuckoo filter. Mainly because of the function CF.ADDNX, add only if the item does not exist. This helped us reduce the engineering effort. It would be amazing if we could have something similar in your implementation.

kristoff-it commented 4 years ago

Hi @shahasim, no worries :)

Good that you found a way to move forward. I won't implement ADDNX for this library because I believe it's easy to misuse it as ADDNX can't work (correctly) in conjunction with REMOVE.

For example if you have A1 and A2, two items that collide, this sequence of operations will break the filter:

ADDNX A1
ADDNX A2 // won't be added because A1 is indistinguishable from A2
REMOVE A2 // will delete A1 because of the same reason as above
CHECK A1 // will return false, because we erroneously deleted it

This behavior is unavoidable given the way probabilistic data structures work. If you are fine with introducing false negatives in the filter, then you're absolutely welcome to do so, but I believe the average users prefers to be protected from such situations. Note how using ADD would have avoided this problem completely.

In fact it's possible to emulate ADDNX also with my library, it's just a combination of CHECK + a conditional ADD, which can be implemented efficiently in a Lua script, but I believe it's a suitable behavior only for extremely rare use cases.

shahasim commented 4 years ago

I agree with what you are saying about strings being similar and causing false positives. But I still think the ADDNX functionality is good to have with of course the caution and caveats that you mentioned. I still think your library provides a lot of flexibility and ADDNX would add more to it but I'll leave that decision to you :)

I wish I had more time to implement it using lua