Open jnievelt opened 8 years ago
I don't know why this require is there:
but I wrote it. :/
What is happening is that we have some range: [a, b]
and with prob >= p
we are inside that range. Now, we say we know it is at least as large as m
but m > b
, so clearly the event m in [a, b]
didn't happen, so we only know something like: [m, m]
is the new range with probability > 0
.
The HLL intersection algorithm is correct, I think, it is just that the noise is so large we don't know what the true value is.
I think, we want to fix .withMin
to just return (m, m, m, 0.0)
in this case, which hopefully consumers see as very uninformed.
Any other ideas?
I suspect this is a sort of edge case in the HLL approximation algo, but it seems there should be a better way to surface this than this exception.
I've reconstructed the case as reported, coming up with two sets with sizes 428 & 395 and intersection size 67:
Using the default hasher and 9 bits:
The size approximations illustrate why our intersection algorithm doesn't work: the min value for the union is larger than the sum of max values for the individual HLLs: