Open jonnyarnold opened 9 years ago
After making this I realised that this already existed many times over on rubygems. Oh well.
But there will be a lot of collisions, when they say 4 bytes, it doesn't mean they will be unique every time, they are using xor to compress.
What would you rather use? I don't know enough about hashing to know how to go about improving this really.
Probably some usual hashing algorithm, like sha1 or maybe even md5.
That would produce really long human hashes :D ... Do conflicts in this case really matter?
Probably yes, if we're talking about millions of records. Anyway, what is the biggest feed we're going to process? million? 10 millions? If so, I think ruby can handle that will a little bit of extra RAM.
It's a performance issue; the lookup time for a 100,000-entry Hash is large. By using a hashing algorithm, we don't have to look up anything, we can just perform our hashing algorithm and move on.
If we can get the claimed collision rate of 1 in 4.3 billion that should be sufficient for the sets of files we want to anonymise. @Antti - if you think the collision rate is higher, can you estimate it?
Currently we create a lookup table as we parse e-mail addresses to ensure we return the same address. This doesn't scale when we get to 100,000+ addresses.
An alternative would be to hash our addresses, but we want to keep the e-mail addresses human readable. Fortunately there is a hashing algorithm around that returns human-readable output. As a bonus, the hashes it returns are ridiculous.
We should look at porting this to Ruby and using it.