different behaviour on format version

be-a-panther commented 2 months ago

I used this bash script to evalute IPv4 bruteforce for different FPR. The dataset.txt are two different sets of IPv4 List. I changed it accordingly between the runs.

wc -l ../testdata*
         5 ../testdata.txt
 255102848 ../testdataRand.txt

selective information dump:

build@build:~/bloomfilter-test$ /home/build/.cargo/bin/poppy --verbose show  v1/filterR_0.001.pop 
File: v1/filterR_0.001.pop
        version                                  : 1
        capacity (desired number of elements)    : 255102848
        n (estimated number of elements)         : 255103777
        fpp (false positive probability)         : 0.001
        Length of data                           : 0.0B
        Size of bloom filter                     : 437.2MB
build@build:~/bloomfilter-test$ /home/build/.cargo/bin/poppy --verbose show  v2/filterR_0.001.pop 
File: v2/filterR_0.001.pop
        version                                  : 2
        capacity (desired number of elements)    : 255102848
        n (estimated number of elements)         : 255041379
        fpp (false positive probability)         : 0.001
        Length of data                           : 0.0B
        Size of bloom filter                     : 437.3MB
build@build:~/bloomfilter-test$ /home/build/.cargo/bin/poppy --verbose show  v1/filter_0.001.pop 
File: v1/filter_0.001.pop
        version                                  : 1
        capacity (desired number of elements)    : 5
        n (estimated number of elements)         : 4
        fpp (false positive probability)         : 0.001
        Length of data                           : 0.0B
        Size of bloom filter                     : 16.0B
build@build:~/bloomfilter-test$ /home/build/.cargo/bin/poppy --verbose show  v2/filter_0.001.pop 
File: v2/filter_0.001.pop
        version                                  : 2
        capacity (desired number of elements)    : 5
        n (estimated number of elements)         : 5
        fpp (false positive probability)         : 0.001
        Length of data                           : 0.0B
        Size of bloom filter                     : 4.0KB

bruteforce ipv4 script

``` #!/usr/bin/env bash version=1 id=R for p in 0.000000000001 0.00000000001 0.0000000001 0.000000001 0.00000001 0.0000001 0.000001 0.00001 0.0001 0.001 0.01 0.1 1.0; do echo "${p}" #echo "creating filter" # creating a filter with a desired capacity `-c` and false positive probability `-p` #/home/build/.cargo/bin/poppy -j 0 create --version ${version} -p ${p} v${version}/filter_${p}.pop testdataRand.txt #cat testdata_rand.txt | /home/build/.cargo/bin/poppy insert filter_${p}.pop # showing information about the filter we just created /home/build/.cargo/bin/poppy show v${version}/filter${id}_${p}.pop > v${version}/result${id}_${p}.txt echo "bruteforce ipv4" ./ipv4 | /home/build/.cargo/bin/poppy check v${version}/filter${id}_${p}.pop 2>/dev/null | wc -l >> v${version}/result${id}_${p}.txt done ```

But I got different results for using version 1 and version 2. I would say, the implementation for version 1 is somehow broken:

testdata Random

version 1: ``` tabs 30; for file in v1/resultR_*; do echo -n ${file}; echo -ne "\t"; tail -n1 ${file}; done; tabs -8 v1/resultR_0.000000000001.txt 2279827179 v1/resultR_0.00000000001.txt 2292496090 v1/resultR_0.0000000001.txt 2307695711 v1/resultR_0.000000001.txt 2279826672 v1/resultR_0.00000001.txt 2297276457 v1/resultR_0.0000001.txt 2319482144 v1/resultR_0.000001.txt 2279912424 v1/resultR_0.00001.txt 2307682256 v1/resultR_0.0001.txt 2348721215 v1/resultR_0.001.txt 2279909914 v1/resultR_0.01.txt 2348745809 v1/resultR_0.1.txt 2541550760 v1/resultR_1.0.txt 0 ``` version 2: ``` tabs 30; for file in v2/resultR_*; do echo -n ${file}; echo -ne "\t"; tail -n1 ${file}; done; tabs -8 v2/resultR_0.000000000001.txt 255104969 v2/resultR_0.00000000001.txt 255105149 v2/resultR_0.0000000001.txt 255105478 v2/resultR_0.000000001.txt 255105775 v2/resultR_0.00000001.txt 255106167 v2/resultR_0.0000001.txt 255106974 v2/resultR_0.000001.txt 255111509 v2/resultR_0.00001.txt 255150265 v2/resultR_0.0001.txt 255524983 v2/resultR_0.001.txt 259186885 v2/resultR_0.01.txt 295735913 v2/resultR_0.1.txt 669617261 v2/resultR_1.0.txt 4294967296 ```

testdata 5 IPv4

version 1: ``` tabs 30; for file in v1/result_*; do echo -n ${file}; echo -ne "\t"; tail -n1 ${file}; done; tabs -8 v1/result_0.000000000001.txt 2169947840 v1/result_0.00000000001.txt 2090326774 v1/result_0.0000000001.txt 2300182147 v1/result_0.000000001.txt 2237356373 v1/result_0.00000001.txt 2226226152 v1/result_0.0000001.txt 2134588848 v1/result_0.000001.txt 2162538453 v1/result_0.00001.txt 2273858034 v1/result_0.0001.txt 2350928224 v1/result_0.001.txt 1996217247 v1/result_0.01.txt 2375923684 v1/result_0.1.txt 2614344677 v1/result_1.0.txt 0 ``` version 2: ``` tabs 30; for file in v2/result_*; do echo -n ${file}; echo -ne "\t"; tail -n1 ${file}; done; tabs -8 v2/result_0.000000000001.txt 30 v2/result_0.00000000001.txt 30 v2/result_0.0000000001.txt 30 v2/result_0.000000001.txt 30 v2/result_0.00000001.txt 30 v2/result_0.0000001.txt 30 v2/result_0.000001.txt 30 v2/result_0.00001.txt 30 v2/result_0.0001.txt 30 v2/result_0.001.txt 30 v2/result_0.01.txt 30 v2/result_0.1.txt 30 v2/result_1.0.txt 712969 ```

filter size

``` stat -c "%s %n" v1/filter* 1833881816 v1/filterR_0.000000000001.pop 1681058336 v1/filterR_0.00000000001.pop 1528234856 v1/filterR_0.0000000001.pop 1375411376 v1/filterR_0.000000001.pop 1222587896 v1/filterR_0.00000001.pop 1069764416 v1/filterR_0.0000001.pop 916940936 v1/filterR_0.000001.pop 764117456 v1/filterR_0.00001.pop 611293976 v1/filterR_0.0001.pop 458470496 v1/filterR_0.001.pop 305647016 v1/filterR_0.01.pop 152823536 v1/filterR_0.1.pop 48 v1/filterR_1.0.pop 88 v1/filter_0.000000000001.pop 88 v1/filter_0.00000000001.pop 80 v1/filter_0.0000000001.pop 80 v1/filter_0.000000001.pop 72 v1/filter_0.00000001.pop 72 v1/filter_0.0000001.pop 72 v1/filter_0.000001.pop 64 v1/filter_0.00001.pop 64 v1/filter_0.0001.pop 64 v1/filter_0.001.pop 56 v1/filter_0.01.pop 56 v1/filter_0.1.pop 48 v1/filter_1.0.pop ``` I guess that the filter is correctly created, but I didn't check furthermore. ``` stat -c "%s %n" v2/filter* 1836384312 v2/filterR_0.000000000001.pop 1682612280 v2/filterR_0.00000000001.pop 1529872440 v2/filterR_0.0000000001.pop 1376682040 v2/filterR_0.000000001.pop 1223540792 v2/filterR_0.00000001.pop 1070596152 v2/filterR_0.0000001.pop 917385272 v2/filterR_0.000001.pop 764379192 v2/filterR_0.00001.pop 611414072 v2/filterR_0.0001.pop 458494008 v2/filterR_0.001.pop 305709112 v2/filterR_0.01.pop 152834104 v2/filterR_0.1.pop 4152 v2/filterR_1.0.pop 4152 v2/filter_0.000000000001.pop 4152 v2/filter_0.00000000001.pop 4152 v2/filter_0.0000000001.pop 4152 v2/filter_0.000000001.pop 4152 v2/filter_0.00000001.pop 4152 v2/filter_0.0000001.pop 4152 v2/filter_0.000001.pop 4152 v2/filter_0.00001.pop 4152 v2/filter_0.0001.pop 4152 v2/filter_0.001.pop 4152 v2/filter_0.01.pop 4152 v2/filter_0.1.pop 4152 v2/filter_1.0.pop ```

qjerome commented 2 months ago

Hi @be-a-panther,

Yes, version 1 is completely broken indeed ! You can find some details about it here: https://www.misp-project.org/2024/03/25/Poppy-a-new-bloom-filter-format-and-project.html/

Basically, version 1 must be used if and only if you need compatibility with format used by https://github.com/DCSO/bloom and related tools using the same format. It is there just for backward compatibility.

Are you experiencing any issue with version 2 ?

be-a-panther commented 2 months ago

Just for documenting purpose:

The main reason for version 1 is currently the use of:

I have currently no real world usage for version 2. I'm not sure if it is better to create a legacy version of poppy. You can keep the version 1 code base separate; I guess you will never touch it again

be-a-panther commented 2 months ago

Sorry for reopening this:

I run those against the golang implementation:

tabs 30; for file in result_*; do echo -n ${file}; echo -ne "\t"; tail -n1 ${file}; done; tabs -8
result_0.000000000001.txt     5                                                                                                                                             
result_0.00000000001.txt      5
result_0.0000000001.txt       6
result_0.000000001.txt        12
result_0.00000001.txt         95
result_0.0000001.txt          247
result_0.000001.txt           4610
result_0.00001.txt            86730
result_0.0001.txt             930419
result_0.001.txt              2014243
result_0.01.txt               68138462
result_0.1.txt                589678745
result_1.0.txt                4294967296

this seems correct. The bloomfilter file was created by the rust implementation. I guess there is a bug within the check function. At the moment the bruteforce with the random IPv4 dataset is running, I will add the result here

qjerome commented 2 months ago

Can you please describe your issue clearly:

what do you expect ?
what do you get ?

Please also create a minimal reproducible example (leading to a problematic behavior) I can run on my side so that I can investigate ?

be-a-panther commented 2 months ago

the used IPv4 addresses

1.1.1.1
10.0.1.1
172.16.101.254
8.8.8.8
217.18.110.93

version 1: different check results I created a set of bloomfilter with only 5 entries, those are created with different FPR. Afterwards I bruteforce the whole IPv4 address space against thse bloomfilters.

I get different result for your implementation (poppy) and the dsco implementation (bloom). This should not happen, because I didn't recreate those bloomfilters. I just reuse those which are created by puppy. Those numbers are the positive matching counts for each bloom filter. If your implementation is compatible with version 1, those numbers should be the same.

qjerome commented 2 months ago

Indeed it seems there was a bug in the way value is checked against the filter. Could you please verify with your use case:

# this would install the fixed version of poppy in /tmp/bin
cargo install --git https://github.com/hashlookup/poppy.git --branch 'fix-version-1' --root /tmp/

qjerome commented 2 months ago

For posterity: the bug was introduced in this commit 12bbf8ffdf255a09cb876ada78722118785abb40 probably to solve in a too complex way the case where we query on an empty filter ...

be-a-panther commented 2 months ago

It seems to work, the bruteforce for 0.000000000001 shows just 5 entries. I will post the complete run after it is done.

qjerome commented 2 months ago

@be-a-panther any news ?

be-a-panther commented 2 months ago

Yeah, sorry for the delay. The results are the same as the golang implementation.

qjerome commented 2 months ago

Thank you @be-a-panther for reporting that issue and for your time testing.

hashlookup / poppy

different behaviour on format version #7