project-callisto / callisto-core

Report intake, escrow, matching and secure delivery code for Callisto, an online reporting system for sexual assault.
https://www.projectcallisto.org
GNU Affero General Public License v3.0
137 stars 54 forks source link

Better anonymization of evaluation data #100

Open divergentdave opened 8 years ago

divergentdave commented 8 years ago

Currently, the encrypted data in an EvalRow includes SHA-256 hashes of the perpetrator identifier, user identifier, and record identifier. IIUC, the user identifier and record identifier are integers from the database. These should be anonymized using a more expensive operation to slow down brute-force searches should the decrypted evaulation data ever be revealed.

Moreover, in the (unlikely) event that there is a perpetrator identifier longer than 64 bytes, the SHA-256 hash of the identifier stored in the EvalRow would be enough key material to decrypt the corresponding match report data. Due to the construction of HMAC-SHA256 inside of PBKDF2-SHA256, any key longer than 64 bytes is replaced with its own SHA-256 hash before anything else happens.

I'd recommend anonymizing these three fields using PBKDF2, after appending distinguishing prefixes for each field, or using distinguishing constants for the salts. To ease migration, we could take a page out of the Django documentation and anonymize the identifiers with something like this:

identifier = pbkdf2(hashlib.sha256(str(id).encode()).hexdigest(), 'which identifier', 100000, digest=hashlib.sha256)

Of course, all of the above is mitigated by GPG encrypting the whole EvalRow. Nevertheless, I think it would be prudent to use a more expensive function here.

kelseyq commented 8 years ago

This all sounds good to me. I do want to point out that the record and user identifiers are extremely guessable (just autoincremented ints), so I'm not convinced that increased iterations are going to give us much in terms of preventing brute force correlation of those. Definitely more important for the identifier hash.

We also need to document SHI's protocols for safeguarding the anonymous data (described briefly here) for potential implementers.