Use human hashing instead of a lookup table

reevoo / anon

Replaces personal e-mails with fake e-mails

MIT License

0 stars 0 forks source link

Use human hashing instead of a lookup table #23

Open jonnyarnold opened 9 years ago

jonnyarnold commented 9 years ago

Currently we create a lookup table as we parse e-mail addresses to ensure we return the same address. This doesn't scale when we get to 100,000+ addresses.

An alternative would be to hash our addresses, but we want to keep the e-mail addresses human readable. Fortunately there is a hashing algorithm around that returns human-readable output. As a bonus, the hashes it returns are ridiculous.

We should look at porting this to Ruby and using it.

lpil commented 9 years ago

After making this I realised that this already existed many times over on rubygems. Oh well.

https://github.com/lpil/wordhash

Antti commented 9 years ago

But there will be a lot of collisions, when they say 4 bytes, it doesn't mean they will be unique every time, they are using xor to compress.

lpil commented 9 years ago

What would you rather use? I don't know enough about hashing to know how to go about improving this really.

Antti commented 9 years ago

Probably some usual hashing algorithm, like sha1 or maybe even md5.

reneklacan commented 9 years ago

That would produce really long human hashes :D ... Do conflicts in this case really matter?

Antti commented 9 years ago

Probably yes, if we're talking about millions of records. Anyway, what is the biggest feed we're going to process? million? 10 millions? If so, I think ruby can handle that will a little bit of extra RAM.

jonnyarnold commented 9 years ago

It's a performance issue; the lookup time for a 100,000-entry Hash is large. By using a hashing algorithm, we don't have to look up anything, we can just perform our hashing algorithm and move on.

If we can get the claimed collision rate of 1 in 4.3 billion that should be sufficient for the sets of files we want to anonymise. @Antti - if you think the collision rate is higher, can you estimate it?