kristoff-it / redis-cuckoofilter

Hashing-function agnostic Cuckoo filters for Redis
MIT License
230 stars 22 forks source link

error ratio explanation #2

Closed migolovanov closed 5 years ago

migolovanov commented 5 years ago

Hello! Seems that i don't quite understand how error ratio works. I wrote simple script to add hashes with 4-byte fingerprint for 1-100000 intregers and then check for false-positives in range 100000-200000. In my case, when filter is full (intentionally choosed lower size, rather than suggested by cf.sizeof command) i get 3% false positives. If i undestand correctly, it should be somewhat around 0.0000001%. Seems that 1-byte fingerprint is used and i miss something, can you explain me what i did wrong? Thank you in advance. The code used for test is listed below:

import redis
import struct
import xxhash
import hashlib

def generate_hash(data):
    data=str(data)
    h=hashlib.new('sha1')
    h.update(str(i).encode("utf8"))
    sha1=h.hexdigest()
    digest=xxhash.xxh64(sha1).intdigest()
    xh=(digest & ((1 << 63) - 1)) - (digest & (1 << 63))
    fp=bytearray.fromhex(sha1[:8])
    fp=struct.unpack("<L", fp)[0]
    fp=(fp & ((1 << 31) - 1)) - (fp & (1 << 31))
    return (xh,fp)

client=redis.Redis()
client.execute_command("flushall")
client.execute_command("cf.init test 1K 4")
for i in range(0,100000):
    xhash,fp=generate_hash(i)
    try:
        client.execute_command("cf.add test {} {}".format(xhash,fp))
    except:
        break

result={"1":0,"0":0}
for i in range(100000,200000):
    xhash,fp=generate_hash(i)
    resp=client.execute_command("cf.check test {} {}".format(xhash,fp))
    result[resp.decode("utf8")]+=1
print(result["1"]/result["0"]*100)
kristoff-it commented 5 years ago

Hi, indeed you're right, I made a mistake while wiring the library to Redis. This commit fixed the error: https://github.com/kristoff-it/redis-cuckoofilter/commit/f8c7e3613ba47135b176bbcff428d9d85eb35aa4

I just released v1.1.0, which should solve all your problems, please try it out. I also made sure to thank you in the release notes.

I also added a last check to your script to see how many items do get recorded by the filter:

found = 0
for i in range(0,100000):
    xhash,fp=generate_hash(i)
    resp=client.execute_command("cf.check test {} {}".format(xhash,fp))
    found += int(resp.decode("utf8")) 
print("found", found)

On my test it returned 225 which is expected considering that cf.capacity 1k 4 returns 256.

Let me know if the issue can be considered resolved.

migolovanov commented 5 years ago

That's awesome! Thank you very much for such quick fix, now everything seems to work perfect.

By the way, i have one more offtopic question. Have you heard about Morton filter (cuckoo filter modification, which intends to work faster, consume less RAM and support of database self-resize feature)? By any chance, do you plan to add its implementation to redis too?

kristoff-it commented 5 years ago

I've heard about them but I have yet to really go through the paper and decide if I want to take the plunge or not... ...but I will probably do it sooner or later :)