kristoff-it / redis-cuckoofilter

Hashing-function agnostic Cuckoo filters for Redis
MIT License
230 stars 22 forks source link

Explanation for parameter `fp` #1

Closed anhldbk closed 6 years ago

anhldbk commented 6 years ago

Thanks for this awesome Redis module.

When reading to section add-an-item, still I can't understand what fp is for. Is it better if we just have to invoke the commands without fp ?

Would you please shed me some light?

kristoff-it commented 6 years ago

Hi!

fp is a required parameter and stands for fingerprint. Depending on the fpsize you chose when creating the filter, it will have to be 1, 2 or 4 bytes coming from the object you want to insert in the filter (in numerical form).

To give you a practical example:

Let's say that you created a filter with CF.INIT fruits 64K and want to insert into it the string "banana". First you have to compute its hash value (using a hashing function of your choice). Let's say that my hashing function produces 141298124 as hash value.

Now, for the fp, you have to choose a way of always extracting the same byte from the items you want to put into the filter. The most basic scheme is to always choose the first byte of the item.

So, for "banana", that would be 98 because the first byte is a 'b' and, using conventional encoding, that results in a numerical value of 98.

So, these would be some example commands that add fruit names into a fruit filter:

# banana
CF.ADD fruits 141298124 98 

# apple
CF.ADD fruits 471225849 97

# pear 
CF.ADD fruits -23894 112

Choosing the first byte is not always a good choice though. If you're keeping track of phone numbers, for example, it's best to use the last byte as it is the byte with the most variability, meaning that you get less collisions when adding millions of them.

When you're adding into the filter things that are more complex than strings, it's up to you to be smart about it. For example, if you're creating a filter for image files, you probably don't want the first byte nor the last because (generally speaking) they might have a header and a footer that depends on the format and actually has no image information at all. In that case the best choice would be to pick the data corresponding to a pixel in the image (and maybe one near the center, there are many ways to be clever about this as you can see).

In those cases, and when you want to use filters with fpsize greater than 1, you can't directly translate the byte value to an integer, but you have to use a library from your programming language of choice.

To give you an example in python:

fp_banana = ord('b') # 98
fp2_banana = ord('ba') # TypeError
# Requires python3:
fp_banana = int.from_bytes(b'b', byteorder='little') # 98
fp2_banana = int.from_bytes(b'ba', byteorder='little') # 24930
fp4_banana = int.from_bytes(b'bana', byteorder='little') # 1634623842

Let me know if this was helpful or if you need any other clarification!

anhldbk commented 6 years ago

@kristoff-it All things clear! Thank you for your nice explanation.