Open longlong354 opened 1 month ago
This appears to have been a deliberate change back in 2012. Maybe @kluever remembers the rationale.
Thanks for reply. My fault didn't express the main point clearly enough. The main recommendation is using
optimalNumOfHashFunctions(double p)
instead of
optimalNumOfHashFunctions(long n, long m)
Hey, @longlong354
So, you just want the usage of (double p) and not (long m, long n) as argument for bloomfilter.
The situation now is roughly that we start from $n$, the expected number of entries, and $p$, the desired false positive probability, and we derive $m$, the optimal number of bits, as
$$ m = \lfloor {-n\ \ln\ p } / { (\ln\ 2)^2} \rfloor $$
Then we further derive $k$, the optimal number of hash functions, as
$$ k = (m / n)\ \ln\ 2 \approx {-n\ \ln\ p } / { \ln\ 2} $$
rounded to an integer. You are proposing instead
$$ k = { -\ln\ p } / { \ln\ 2 } $$
removing the factor of $n$. That's a different number. Are you saying that it's more accurate? Could you explain why?
The derivation of the formula
k = (m / n) ln 2 ≈ -n ln p / ln 2
is inaccurate; it should be
k = (m / n) ln 2 ≈ - ln p / ln 2
Please kindly note that the bolded 'n' in the numerator cancels out with the 'n' in m( m = ⌊ − n ln p / ( ln 2 ) 2 ⌋ )"
ps: refering to https://en.wikipedia.org/wiki/Bloom_filter#Optimal_number_of_hash_functions also can be verified in https://krisives.github.io/bloom-calculator/ that n does not affect k
Hey, @longlong354
So, you just want the usage of (double p) and not (long m, long n) as argument for bloomfilter.
yep~
API(s)
How do you want it to be improved?
1. use static caculated value of log(2) and squared log(2) :
2. calculate optimalNumOfBits by static values:
3. caculate optimalNumOfHashFunction by false positive rate(p) directly and using LOG_TWO :
Why do we need it to be improved?
Example
Current Behavior
as the source codes in “How do you want it to be improved”
Desired Behavior
as the given codes in com.google.common.hash.BloomFilter::optimalNumOfBits(long n, double p) & com.google.common.hash.BloomFilter::optimalNumOfHashFunctions(long n, long m)
Concrete Use Cases
as "How do you want it to be improved"
Checklist
[X] I agree to follow the code of conduct.
[X] I have read and understood the contribution guidelines.
[X] I have read and understood Guava's philosophy, and I strongly believe that this proposal aligns with it.