pydicom / deid

best effort anonymization for medical images using python
https://pydicom.github.io/deid/
MIT License
140 stars 43 forks source link

WIP+ENH: Add REMAP action for automatically remapping (instance) UIDs #203

Closed moloney closed 2 years ago

moloney commented 2 years ago

Description

This PR improves (instance) UID handling by allowing them to be automatically remapped through a new "REMAP" action. The new UIDs are created in a way that strips all PHI (usually timestamps embedded into the UID) while preserving the study/series hierarchy (defined by the UIDs) in the anonymized output. This allows the anonymized data to be(for example) uploaded to a PACS or other network entity.

Checklist

Open questions

I also updated the default recipe to handle UIDs in this way, if you prefer I don't do that (at least for now) let me know.

vsoch commented 2 years ago

Hey thanks for the PR!

So this seems very hard coded for your specific use case (which is okay) but I want to challenge you to think of this more generally. E.g., what you really are doing is running a custom function given a specific VR value. E.g.,:

field.element.VR == 'UI

So why is this not just a replace with a custom function? The function is handed the field so you'd just do that check within the function and act appropriately. Is there a reason this approach will not work?

vsoch commented 2 years ago

E.g.:

REPLACE contains:((?<!SOPClass)UID) func:remap_uid
def remap_uid(item, value, field):
    """Remap existing UID in stable and secure manner

    Same input UID creates the same output UID, which keeps study / series
    associations correct (or even FrameOfReferenceUID) provided the full
    study / series in anonymized. At same time input UID can't be recovered
    from output so any potential PHI (i.e. dates / times) is eliminated.
    """
    if field.element.VR == 'UI':
        new_val = remap_uid(field.element)

        if isinstance(field.value, list):
            # Handle VM > 1 
            return [generate_uid(entropy_srcs=[x]) for x in field.value]
        else:
            return generate_uid(entropy_srcs=[field.value])

    # default to return the new value
    return new_val
moloney commented 2 years ago

I agree this could just be a separate function rather than a dedicated "action". The line is a little blurry, JITTER could just be a function too, right? Does the 'deid' package provide a library of useful functions like this that could then be reference in the default recipe? I think this remapping behavior for UIDs is a much better default than stripping them as the latter produces essentially "broken" DICOM files.

vsoch commented 2 years ago

The difference is that JITTER is an explicit but general action, and one that is generally requested across this process so the dates aren't the same. This REMAP isn't really intuitive, arguably it could describe something else, and it's on the same par as a custom uid function. So I think in my mind, the two are fairly different.

Does the 'deid' package provide a library of useful functions like this that could then be reference in the default recipe? I think this remapping behavior for UIDs is a much better default than stripping them as the latter produces essentially "broken" DICOM files.

But I love this idea! Indeed it would be really useful to provide custom functions, and make it easy to contribute and then use! If you want to talk about a design to support that (and then make it easier for the user to use as opposed to needing to roll their own as you have) I'd definitely be open to this new addition. Perhaps the current deid/dicom/actions.py could be renamed to a module folder, e.g.,:

deid/dicom/actions/
    __init__.py
   jitter.py             <- the current jitter timestamp function

and JITTER would be grandfathered into being it's own term (for the reasons I specified) but technically it's still an action and could go alongside the actions (I'll leave the organization up to you - I optimize to make it easier for the developer to find things). But then perhaps we could have a uids set of actions:

deid/dicom/actions/
    __init__.py
   jitter.py             <- the current jitter timestamp function
   uids.py

Into which we can put the documentation example, along with your function here! And then for usage:

REPLACE contains:((?<!SOPClass)UID) deid_func:remap_uid

and all the custom functions would be imported into init.py and thus accessible via deid.dicom.actions.<name> so you'd parse that it's a deid_func, get the string, and then use importlib to get it. Then we would have a nice table of custom functions defined, and instructions for adding a custom function (basically writing a new module in that file).

Let me know your thoughts! That's just a quick idea for a design.

moloney commented 2 years ago

I agree, that is a better approach and will make it much easier to add more custom functions in the future. I will close this and open a different PR.

vsoch commented 2 years ago

Awesome! Looking forward to seeing it - please ping me if you want to have any discussion, etc.

wetzelj commented 2 years ago

This is a wonderful idea!

vsoch commented 2 years ago

@moloney and @wetzelj please see https://github.com/pydicom/deid/pull/208 !