pydicom / deid

best effort anonymization for medical images using python
https://pydicom.github.io/deid/
MIT License
140 stars 43 forks source link

changed example to convert uuid to string representation of bigint an… #131

Closed sjswerdloff closed 4 years ago

sjswerdloff commented 4 years ago

Return DICOM conformant UID in example

Description

changed example to convert uuid to string representation of bigint and append to an org root so the returned uid itself was DICOM conformant for an element with Value Representation of UI. Borrowed PYMEDPHYS_ROOT_UID for the example. Probably should have used PYDICOM org root: "1.2.826.0.1.3680043.8.498" or better yet... pydicom.uid.generate_uid() (which has an optional prefix/org_root argument) Related issues: # 130

Please include a summary of the change(s) and if relevant, any related issues above.

Checklist

or at least black didn't complain in any way it wasn't already complaining

Open questions

Questions that require more discussion or to be addressed in future development: Q1) Depending on the size/length of the org root, the generated value might need to be sliced to fit. This increases the likelihood of a collision. A mitigation of that is the use of the ORG_ROOT. Q2) As a child of pydicom, maybe it's better to use pydicom.uid.generate_uid() for the example

vsoch commented 4 years ago

hey @sjswerdloff thanks for the pointer on the contributing.md - I edited the first bullet so it directs all PRs to the master branch!

This is a great start for discussion! I think there are two ways to go about this - the first is to say "We don't want to go into gray area where we might potentially tell someone the wrong thing about generating uids. For the example here, we would update the function to be something named "generate_identity" and then perhaps use the same simple uuid.uuit4() to generate a PatientName or PatientID or something like that. The second option (that you've started here) is to keep the fields, and generate a unique id. But isn't this (for the most part) the job of the machine? So is is really a reasonable / real life example? My original intention was to show an example to generate a dummy identifier to replace the real identifiers during some de-identification process. For example, if we had dicoms for some research study. I suspect the researchers wouldn't care much about these fields and would just blank them.

Whatever we decide, we need to be consistent, because this particular generate_uid is found across several files (examples, code, and docs)

I do think if you've put in the work to figure out how to generate a correct uid, it should definitely be added to https://github.com/pydicom/contrib-pydicom. So in summary - we would want to choose something consistent to do that is in line with a reasonably actual use case, and then make these changes across all the mentions of the function. Let me know your thoughts!

sjswerdloff commented 4 years ago

The second option (that you've started here) is to keep the fields, and generate a unique id. But isn't this (for the most part) the job of the machine?

For the creation of the original data, yes, creation of the UIDs is the responsibility of a variety of systems involved in the patient's journey. However, given we are trying to de-identify, real world UIDs tend to have time stamps in them. They shouldn't, but they do, and back before people were quite as worried about simply exposing the date as means of re-identifying data I did the same thing. As a vendor or developer, it's not unusual to store or export an SOP Instance as a file, often Part 10, with the SOP Instance UID being the major part of the file name, and if the SOP Instance UID you generated for the file has a time stamp in it, it makes it a lot easier to figure out quickly "which data you need to go look at". Being guilty of this practice, at least in the past, I don't find fault with the very large number of implementations that did this (and still do). So it is in fact important to replace the UIDs. And there are frequently references to a UID held in a separate object, so it is important to maintain consistency in replacing a UID

So is is really a reasonable / real life example?

Definitely. See https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4170049/ I didn't do the programming in the above paper. I provided the DICOM/domain expertise.

put in the work to figure out how to generate a correct uid

in retrospect, I probably just reinvented the wheel without realizing it, and pydicom.uid.generate_uid(), if you look at the details, is doing about the same thing. And it's more heavily tested, more flexible, etc.

It's a real use case. My intent (in terms of what I'm working on in FOSS over the next wee while) is to implement a set of functions in python that accomplish basically what was written about in the paper I reference above, less the scrubbing of the pixel data (which you already have covered). I have been working on that in PyMedPhys, parallel to the anonymise module (no PR yet), for utilisation in OnkoDICOM. But the same functions could be wrapped as you have shown, and instead (or in addition to) be applied in deid, which has a more comprehensive approach. It's a somewhat different use case, and for that I'm taking the UID, hashing it (SHA3-256), and then converting the hash in to a "BigInt" (which allows for simple forward calculating consistency if you find the same UID being referenced in some other DICOM object).

If you want to have a different kind of identifier, and you want to use uuid, that's fine. It really is. The use of uuid as the crux of a pseudo-random value generator is very respectable. And if you want to use that UUID for a text value for an element whose VR has an allowable character set that is a superset of what is valid for a UUID ([a-f0-9-] ? hexadecimal and the dash symbol?), and then slice it as needed to fit the size (there are some VRs that have length 16, while UUID is apparently 36 characters), that's OK by me.

But if one is showing DICOM as an example, and one uses "uid" as the acronym instead of uuid, it gives the distinct impression to the user or programmer that one is going to provide a DICOM UID that meets the requirements of the UI Value Representation. It will succeed in generating output, but the resulting DICOM file would fail validation (immediately), and would never be able to be transmitted to an SCP.

if we had dicoms for some research study. I suspect the researchers wouldn't care much about these fields

They would care if the data could no longer be consumed by typical DICOM tools, transmitted between typical DICOM systems, nor stored in a typical DICOM storage application/appliance (PACS like thing). There are many use cases where people are working on a local file system, and just yanking out the data that they need, so for those scenarios, they would manage, but for many use cases, things would break quickly.

I can go back and commit a fix that imports pydicom and makes a direct call to pydicom.uid.generate_uid(), and update my PR if that would be better.

vsoch commented 4 years ago

That would work for me! Let's make sure we hit most of the spots I pointed out for consistency, and perhaps also just mention that the user is free to write their own function (for example, the DicomParser might be a good place to put that).

vsoch commented 4 years ago

And no worries about the formatting, that definitely happens to me too :)

sjswerdloff commented 4 years ago

I ran black against my update also, and it modified code I hadn't otherwise touched. I can revert that if its a problem

A bigger issue is that now I see that one of the documentation files is showing the use of replacing a date with a uuid as well. Changing the function to generate a DICOM UID and then using that to populate InstanceCreationDate (or more generally, a DICOM element with VR of DA) is not really any better.

Any individual replacement of a value needs to provide something that is valid for the Value Representation.

A specific approach for dealing with dates can be found in http://www.dclunie.com/pixelmed/software/webstart/DicomCleanerUsage.html The TL;DR of that is the user specifying a "begin epoch" date, and time shift everything with respect to the earliest activity that shows up in the set of data being operated on (typically the Study Date).

Sidebar: I don't understand why there is a replace that uses a single blank space in both parameters. I had not noticed that before. I often see replacements of blank space with underscore.

Tentative Conclusion: perhaps it would be best, after all, to leave things as they are and just indicate in the documentation that the example will in fact result in values that do not meet the content requirements for most VR, and practical implementations will need to address that, i.e.: just mention that "the user is free to write their own function that meets the content requirements for the particular VR of the element whose content is being replaced"

Which is basically what I'm working on, so I'm likely to be an example of that user. I'm not going to be offended if the PR gets rejected, it's probably simpler that way for now. Once I have working code to accomplish what brought me here in the first place, that might serve as a fully functional example. The current example accomplishes the intent of showing someone how to go about writing and utilizing their own function. I will need to dig deeper to see if it is better to have a single function get called, with the function internally identifying the VR involved and applying an appropriate algorithm to create a replacement appropriate to the VR, or writing separate functions and having those called separately from higher up.

vsoch commented 4 years ago

I ran black against my update also, and it modified code I hadn't otherwise touched. I can revert that if its a problem

This does happen with different versions of black - don't worry about it for now!

A bigger issue is that now I see that one of the documentation files is showing the use of replacing a date with a uuid as well. Changing the function to generate a DICOM UID and then using that to populate InstanceCreationDate (or more generally, a DICOM element with VR of DA) is not really any better.

We should definitely fix this up - do you want to take a shot, or show me where it is so that I can?

Sidebar: I don't understand why there is a replace that uses a single blank space in both parameters. I had not noticed that before. I often see replacements of blank space with underscore.

Could you elaborate on this a bit (sidenote?) Or give an example? I'm not sure that I follow.

perhaps it would be best, after all, to leave things as they are and just indicate in the documentation that the example will in fact result in values that do not meet the content requirements for most VR, and practical implementations will need to address that, i.e.: just mention that "the user is free to write their own function that meets the content requirements for the particular VR of the element whose content is being replaced"

This would be spot on with how I would address these issues - the examples weren't meant to be real world-esche, but really simple / dummy to just show the functionality. I do agree that we don't want to show bad / wrong practice, however. I think adding a note in bold to the sections of the example and documentation would be a good solution for this, with perhaps a link to the resources you've already shared that are examples of how to do it correctly (e.g., pydicom's uid generator).

I would also still be okay if you wanted to update the other pages in the docs (I listed them above for you) where uid is mentioned. Let me know what you would like to discuss and how I can help.

sjswerdloff commented 4 years ago

I was working through running the modified code and now I see that the input sample file ("MR.dcm") is not valid DICOM to start with, it already has invalid content for UIDs and dates. Not only are there uuids already in place for some of these, but there are date values that may not have been replaced (if you look at them, they appear as human readable dates, just not compliant, e.g. 2009-11-24 rather than 20091124, and the rows x columns x bits allocated/8 didn't match the number of bytes of pixel data (so I took a guess by factoring the # of byte/2), but I made no effort to view the pixel data. I've run it through DVTK, found and "fixed" (modified to a value that met the conformance requirements and seemed somewhat reasonable given the other data in the file).

hopefully it is attached as MR_cleanup_two.zip

First the input data needs to be valid/conformant, otherwise testing for the validity/conformance of the output data is a futile exercise.

I don't know if "correcting" the MR.dcm file will cause tests to break, which is why I didn't try to put it in the PR itself.

vsoch commented 4 years ago

@sjswerdloff I can help to integrate the MR.dcm into the PR - but I don't see it, where is it attached?

sjswerdloff commented 4 years ago

I had tried to attach the file to comment, but that didn't work. I have now put the updated MR.dcm file in the PR.

vsoch commented 4 years ago

Great! I'll be able to take a look today. For future work, the hack to add a non-traditional (usually non text or pdf file) to be attached to a comment is to add a .txt extension, e.g., MR.dcm.txt. Then the only limit would be the file size. Otherwise, you can add to the PR as you did.

vsoch commented 4 years ago

I'm working on this now - there is a bug that the sequence isn't updated (as it used to be) so hopefully I'll figure out a fix.

vsoch commented 4 years ago

@sjswerdloff I've updated your work to:

  1. be consistent in the update to replace_uid - you had only done it for an example file, I did it across the documentation
  2. the example.py shouldn't use a replace_uid, but instead a replace_date
  3. I also fixed a bug with the replacement of sequences

If this looks good to you and you are happy, I'd like to suggest changing this from a draft so we can get it merged and released.

vsoch commented 4 years ago

Also, when I tried the pydicom function to generate the uid, it gave me an error about the prefix. So I used your home grown one instead.

vsoch commented 4 years ago

Also just ran black on the examples folder to clean it up!

sjswerdloff commented 4 years ago

Great! I didn't intend to leave the additional files for others (you) to do, but I'm most grateful that you addressed them. I do intend to come back with a fully implemented example that utilises "pseudonymisation" (hashing of text but encoding within the VR constraints, offset/jitter of dates, ) as it's means of de-identifying.