Closed be-a-panther closed 2 months ago
Hi @be-a-panther,
Yes, version 1 is completely broken indeed ! You can find some details about it here: https://www.misp-project.org/2024/03/25/Poppy-a-new-bloom-filter-format-and-project.html/
Basically, version 1 must be used if and only if you need compatibility with format used by https://github.com/DCSO/bloom and related tools using the same format. It is there just for backward compatibility.
Are you experiencing any issue with version 2 ?
Just for documenting purpose:
The main reason for version 1 is currently the use of:
I have currently no real world usage for version 2. I'm not sure if it is better to create a legacy version of poppy. You can keep the version 1 code base separate; I guess you will never touch it again
Sorry for reopening this:
I run those against the golang implementation:
tabs 30; for file in result_*; do echo -n ${file}; echo -ne "\t"; tail -n1 ${file}; done; tabs -8
result_0.000000000001.txt 5
result_0.00000000001.txt 5
result_0.0000000001.txt 6
result_0.000000001.txt 12
result_0.00000001.txt 95
result_0.0000001.txt 247
result_0.000001.txt 4610
result_0.00001.txt 86730
result_0.0001.txt 930419
result_0.001.txt 2014243
result_0.01.txt 68138462
result_0.1.txt 589678745
result_1.0.txt 4294967296
this seems correct. The bloomfilter file was created by the rust implementation. I guess there is a bug within the check function. At the moment the bruteforce with the random IPv4 dataset is running, I will add the result here
Can you please describe your issue clearly:
what do you get ?
Please also create a minimal reproducible example (leading to a problematic behavior) I can run on my side so that I can investigate ?
1.1.1.1
10.0.1.1
172.16.101.254
8.8.8.8
217.18.110.93
check
results
I created a set of bloomfilter with only 5 entries, those are created with different FPR. Afterwards I bruteforce the whole IPv4 address space against thse bloomfilters.I get different result for your implementation (poppy) and the dsco implementation (bloom). This should not happen, because I didn't recreate those bloomfilters. I just reuse those which are created by puppy. Those numbers are the positive matching counts for each bloom filter. If your implementation is compatible with version 1, those numbers should be the same.
Indeed it seems there was a bug in the way value is checked against the filter. Could you please verify with your use case:
# this would install the fixed version of poppy in /tmp/bin
cargo install --git https://github.com/hashlookup/poppy.git --branch 'fix-version-1' --root /tmp/
For posterity: the bug was introduced in this commit 12bbf8ffdf255a09cb876ada78722118785abb40 probably to solve in a too complex way the case where we query on an empty filter ...
It seems to work, the bruteforce for 0.000000000001 shows just 5 entries. I will post the complete run after it is done.
@be-a-panther any news ?
Yeah, sorry for the delay. The results are the same as the golang implementation.
Thank you @be-a-panther for reporting that issue and for your time testing.
I used this bash script to evalute IPv4 bruteforce for different FPR. The
dataset.txt
are two different sets of IPv4 List. I changed it accordingly between the runs.selective information dump:
bruteforce ipv4 script
``` #!/usr/bin/env bash version=1 id=R for p in 0.000000000001 0.00000000001 0.0000000001 0.000000001 0.00000001 0.0000001 0.000001 0.00001 0.0001 0.001 0.01 0.1 1.0; do echo "${p}" #echo "creating filter" # creating a filter with a desired capacity `-c` and false positive probability `-p` #/home/build/.cargo/bin/poppy -j 0 create --version ${version} -p ${p} v${version}/filter_${p}.pop testdataRand.txt #cat testdata_rand.txt | /home/build/.cargo/bin/poppy insert filter_${p}.pop # showing information about the filter we just created /home/build/.cargo/bin/poppy show v${version}/filter${id}_${p}.pop > v${version}/result${id}_${p}.txt echo "bruteforce ipv4" ./ipv4 | /home/build/.cargo/bin/poppy check v${version}/filter${id}_${p}.pop 2>/dev/null | wc -l >> v${version}/result${id}_${p}.txt done ```But I got different results for using version 1 and version 2. I would say, the implementation for version 1 is somehow broken:
testdata Random
version 1: ``` tabs 30; for file in v1/resultR_*; do echo -n ${file}; echo -ne "\t"; tail -n1 ${file}; done; tabs -8 v1/resultR_0.000000000001.txt 2279827179 v1/resultR_0.00000000001.txt 2292496090 v1/resultR_0.0000000001.txt 2307695711 v1/resultR_0.000000001.txt 2279826672 v1/resultR_0.00000001.txt 2297276457 v1/resultR_0.0000001.txt 2319482144 v1/resultR_0.000001.txt 2279912424 v1/resultR_0.00001.txt 2307682256 v1/resultR_0.0001.txt 2348721215 v1/resultR_0.001.txt 2279909914 v1/resultR_0.01.txt 2348745809 v1/resultR_0.1.txt 2541550760 v1/resultR_1.0.txt 0 ``` version 2: ``` tabs 30; for file in v2/resultR_*; do echo -n ${file}; echo -ne "\t"; tail -n1 ${file}; done; tabs -8 v2/resultR_0.000000000001.txt 255104969 v2/resultR_0.00000000001.txt 255105149 v2/resultR_0.0000000001.txt 255105478 v2/resultR_0.000000001.txt 255105775 v2/resultR_0.00000001.txt 255106167 v2/resultR_0.0000001.txt 255106974 v2/resultR_0.000001.txt 255111509 v2/resultR_0.00001.txt 255150265 v2/resultR_0.0001.txt 255524983 v2/resultR_0.001.txt 259186885 v2/resultR_0.01.txt 295735913 v2/resultR_0.1.txt 669617261 v2/resultR_1.0.txt 4294967296 ```testdata 5 IPv4
version 1: ``` tabs 30; for file in v1/result_*; do echo -n ${file}; echo -ne "\t"; tail -n1 ${file}; done; tabs -8 v1/result_0.000000000001.txt 2169947840 v1/result_0.00000000001.txt 2090326774 v1/result_0.0000000001.txt 2300182147 v1/result_0.000000001.txt 2237356373 v1/result_0.00000001.txt 2226226152 v1/result_0.0000001.txt 2134588848 v1/result_0.000001.txt 2162538453 v1/result_0.00001.txt 2273858034 v1/result_0.0001.txt 2350928224 v1/result_0.001.txt 1996217247 v1/result_0.01.txt 2375923684 v1/result_0.1.txt 2614344677 v1/result_1.0.txt 0 ``` version 2: ``` tabs 30; for file in v2/result_*; do echo -n ${file}; echo -ne "\t"; tail -n1 ${file}; done; tabs -8 v2/result_0.000000000001.txt 30 v2/result_0.00000000001.txt 30 v2/result_0.0000000001.txt 30 v2/result_0.000000001.txt 30 v2/result_0.00000001.txt 30 v2/result_0.0000001.txt 30 v2/result_0.000001.txt 30 v2/result_0.00001.txt 30 v2/result_0.0001.txt 30 v2/result_0.001.txt 30 v2/result_0.01.txt 30 v2/result_0.1.txt 30 v2/result_1.0.txt 712969 ```filter size
``` stat -c "%s %n" v1/filter* 1833881816 v1/filterR_0.000000000001.pop 1681058336 v1/filterR_0.00000000001.pop 1528234856 v1/filterR_0.0000000001.pop 1375411376 v1/filterR_0.000000001.pop 1222587896 v1/filterR_0.00000001.pop 1069764416 v1/filterR_0.0000001.pop 916940936 v1/filterR_0.000001.pop 764117456 v1/filterR_0.00001.pop 611293976 v1/filterR_0.0001.pop 458470496 v1/filterR_0.001.pop 305647016 v1/filterR_0.01.pop 152823536 v1/filterR_0.1.pop 48 v1/filterR_1.0.pop 88 v1/filter_0.000000000001.pop 88 v1/filter_0.00000000001.pop 80 v1/filter_0.0000000001.pop 80 v1/filter_0.000000001.pop 72 v1/filter_0.00000001.pop 72 v1/filter_0.0000001.pop 72 v1/filter_0.000001.pop 64 v1/filter_0.00001.pop 64 v1/filter_0.0001.pop 64 v1/filter_0.001.pop 56 v1/filter_0.01.pop 56 v1/filter_0.1.pop 48 v1/filter_1.0.pop ``` I guess that the filter is correctly created, but I didn't check furthermore. ``` stat -c "%s %n" v2/filter* 1836384312 v2/filterR_0.000000000001.pop 1682612280 v2/filterR_0.00000000001.pop 1529872440 v2/filterR_0.0000000001.pop 1376682040 v2/filterR_0.000000001.pop 1223540792 v2/filterR_0.00000001.pop 1070596152 v2/filterR_0.0000001.pop 917385272 v2/filterR_0.000001.pop 764379192 v2/filterR_0.00001.pop 611414072 v2/filterR_0.0001.pop 458494008 v2/filterR_0.001.pop 305709112 v2/filterR_0.01.pop 152834104 v2/filterR_0.1.pop 4152 v2/filterR_1.0.pop 4152 v2/filter_0.000000000001.pop 4152 v2/filter_0.00000000001.pop 4152 v2/filter_0.0000000001.pop 4152 v2/filter_0.000000001.pop 4152 v2/filter_0.00000001.pop 4152 v2/filter_0.0000001.pop 4152 v2/filter_0.000001.pop 4152 v2/filter_0.00001.pop 4152 v2/filter_0.0001.pop 4152 v2/filter_0.001.pop 4152 v2/filter_0.01.pop 4152 v2/filter_0.1.pop 4152 v2/filter_1.0.pop ```