Open Apsod opened 6 months ago
While the resulting numbers are not the same, this does not seem conclusive proof to me that you get more collisions, you will simply get them on different values as in the end you will always have the modulo squashing everything to the same value range for both cases The pure python implementation has the (very big) disadvantage of being considerably slower
I agree that this is not conclusive proof of more collisions, however, it seems like a bug to me to purportedly do affine transforms modulo mersenne primes, when this is not what the code is doing. Currently, the implementation is doing the following:
def h3(a, b, shingles):
# Native python, simulating overflow, equals to h1.
rows = []
for sj in shingles[:, 0].tolist():
rows.append([
((sj * ai + bi) % (1<<64)) % _mersenne_prime
for ai, bi in zip(a[0].tolist(), b[0].tolist())
])
return np.array(rows)
At which point I doubt the whole mersenne prime field serves any purpose, and you can just go mod (1<<64), i.e. no mod (or mersenne primes) at all.
Seeing as this repo inherits lots of code from https://github.com/ekzhu/datasketch, it should be noted that the implementation of mersenne prime hashing used in both repos causes overflows, and potentially more hash collisions than intended: