mercure-imaging / mercure-anonymizer

DICOM anonymization module for mercure
https://mercure-imaging.org
GNU General Public License v3.0
1 stars 0 forks source link

Add option for using a hashed Study UID / Series UID instead of newly generated UID #2

Open tblock79 opened 11 months ago

chrstphmr commented 10 months ago

Couple of thoughts on this:

What attributes to hash

UIDs: It'd be nice to have a global option to hash all UIDs that have not been explicitly set to 'keep' either in a preset or in the settings. This way, the various InstanceUIDs and, importantly, also the cross-references that point to an instance (ReferencedXXXUID) should be preserved in a relatively future-proof manner.

Beyond UIDs, the hash function should also be applicable to other tags, a few that come to my mind are: (0008,0050) SH Accession Number (0010,0020) LO Patient ID (0010,0010) PN Patient's Name

How to hash

Compute a cryptographic hash function of the original value of the DICOM attribute together and a secret salt. The salt protects against pre-image attacks/rainbow tables.

The salt would need to be passed to the anonymizer as a module setting to ensure that the same input value results in the same output value. This way, multiple Mercure instances in one institution can also create reproducible hash values. The module should fail if it's configured to hash but no salt has been passed to it. I guess it's fair to pass on the generation of the salt to the Mercure admin, maybe with some guidance in the documentation explaning how to obtain a random salt.

There are multiple suitable hash functions of course, e.g. SHA-256. Alternatively, a key derivation function such as argon2id might be considered, but that is probably overkill.

Mapping the digest

The digest of the hash function needs to be mapped to a value that is valid for the DICOM value representation of the respective tag.

For SH and LO, this is should be straightforward, e.g. base64 or hex of the digest.

For UI, it's a bit more tricky: UIDs require either having an assigned org prefix or using 2.25. plus a 128 bit UUID. Either way, the UID cannot be longer than 64 decimal digits including separators. The digest length of many modern hash functions is longer than 64 decimal digits (e.g., 2**256 = 1e77). Easiest approach might be to truncate the digest to 128 bits and use the 2.25. prefix. The loss in entropy should be acceptable, considering that changing UIDs is required, but only adds little to the overall robustness of the anonymization as long as PixelData remains unchanged (see note 3 in ref 3).

The module should fail if no mapping function has been implemented for the VR of the attribute that is being processed.

Refs

1) https://wiki.cancerimagingarchive.net/display/Public/Submission+and+De-identification+Overview 1) https://dicom.nema.org/medical/dicom/current/output/chtml/part05/sect_6.2.html 2) https://dicom.nema.org/medical/dicom/current/output/chtml/part15/sect_e.3.9.html