praekeltfoundation / vumi-wikipedia

Vumi Wikipedia Application
BSD 3-Clause "New" or "Revised" License
8 stars 8 forks source link

Logs should hash any unique user identifier #24

Closed smn closed 11 years ago

smn commented 11 years ago

For log parsing the actual users MSISDN is not needed, it needs to be hashed before being printed.

@jerith have you got thoughts on this?

jerith commented 11 years ago

It's useful to have the real identifier available for debugging and such, but those can be different logs from the ones we're emitting to be parsed.

Actually anonymising the identifier is tricky, since it's small enough that any hash can be brute-forced. What kind of correlation ability are we after? It's pretty straightforward to generate a random identifier and stash it in the session but that won't let us connect different sessions (which has been useful for tracking down bugs in the past).

smn commented 11 years ago

We'll be able to lookup the real identifier anyway if we log the message_id along with the hashed msisdn.

We just need a unique identifier that allows people analyzing the logs to identify someone without needing their msisdn. I'm kinda -1 on sticking it in the session as those are generally assumed to expire.

Why not something like (borrowing a bit from Django's password hashing):

pbkdf2(from_addr, salt, iterations)

The problem though is that iterations is recommended to be 10K, which can take up to 100ms.

smn commented 11 years ago

Before I push further on this, @jerith and @hodgestar -- thoughts on the approach?

jerith commented 11 years ago

I don't know enough about this stuff to know if it's a good solution, but it's definitely better than nothing.

hodgestar commented 11 years ago

I think the scheme @smn implemented might be good enough if we declare that the "salt" is really an additional secret to be protected. Then the security becomes a lot better because the attacker has to figure out the salt too (and many different phone numbers might match the same MSISDN with different values of the "salt"). Maybe we should rename the hash_salt to secret_key though?

smn commented 11 years ago

The feedback is that the approach taken is fine to go with a sha256 but then chop the hash in half before logging to make it impossible to reverse. They're willing to deal with the small chance of collisions in the keyspace as a result.

Will go ahead and implement this and write a test.

smn commented 11 years ago

@jerith @hodgestar ready for review.

hodgestar commented 11 years ago

:+1: