Open divergentdave opened 8 years ago
This all sounds good to me. I do want to point out that the record and user identifiers are extremely guessable (just autoincremented ints), so I'm not convinced that increased iterations are going to give us much in terms of preventing brute force correlation of those. Definitely more important for the identifier hash.
We also need to document SHI's protocols for safeguarding the anonymous data (described briefly here) for potential implementers.
Currently, the encrypted data in an EvalRow includes SHA-256 hashes of the perpetrator identifier, user identifier, and record identifier. IIUC, the user identifier and record identifier are integers from the database. These should be anonymized using a more expensive operation to slow down brute-force searches should the decrypted evaulation data ever be revealed.
Moreover, in the (unlikely) event that there is a perpetrator identifier longer than 64 bytes, the SHA-256 hash of the identifier stored in the EvalRow would be enough key material to decrypt the corresponding match report data. Due to the construction of HMAC-SHA256 inside of PBKDF2-SHA256, any key longer than 64 bytes is replaced with its own SHA-256 hash before anything else happens.
I'd recommend anonymizing these three fields using PBKDF2, after appending distinguishing prefixes for each field, or using distinguishing constants for the salts. To ease migration, we could take a page out of the Django documentation and anonymize the identifiers with something like this:
Of course, all of the above is mitigated by GPG encrypting the whole EvalRow. Nevertheless, I think it would be prudent to use a more expensive function here.