pydicom / deid

best effort anonymization for medical images using python
https://pydicom.github.io/deid/
MIT License
141 stars 43 forks source link

Additional field expanders #87

Closed howardpchen closed 5 years ago

howardpchen commented 5 years ago

My group's use case includes passing a function to all DICOM Tags to check for certain PHI words/phrases. Noticed that while we can do startswith and endswith, there isn't an easy way to apply func:my_function on most or all of the DICOM tags. Would it be of interest if I expanded the deid/dicom/fields.py functionality (e.g. add expander words like "all," "contains," "allexcept" etc) and sent out a pull request? Thanks!

vsoch commented 5 years ago

I think we do have a use case for this, I released it recently - see this example https://github.com/pydicom/deid/tree/master/examples/dicom/header-manipulation

I think perhaps the addition you would need is a way to apply the function to ALL tags? Right now, the use case is for a specific tag like:

REPLACE StudyInstanceUID func:generate_uid

and then you would pass into the items dictionary like:

for item in items:
    items[item]['generate_uid'] = generate_uid

Maybe something like this would work to specify all tags?

REPLACE * func:clean_data

And you would still add it

for item in items:
    items[item]['generate_uid'] = clean_data

Does the * seem appropriate to indicate all tags? or should we have a keyword ALL (assuming that it's never going to be an actual tag :OP)

REPLACE ALL func:clean_data

As a developer * is more intuitive, but for users that aren't programmers ALL might be a better choice. What do you think?

vsoch commented 5 years ago

If you have something very specific / custom you could also just run it over the fields after you extract them, e.g.,

new_items = dict()
for field, value in items['items']:
    new_items[field] = func(value)   
items['items'] = new_items
howardpchen commented 5 years ago

Agree that * is less intuitive to non programmers, who seem to be the key target audience for the Recipes mechanics.

I was thinking to leverage existing field expander mechanism which can handle

REPLACE allFields: func:clean_data

with a few extra lines:

    if expander.lower() == "endswith":
        fields = [x for x in contenders if x.endswith(expression)]
    elif expander.lower() == "startswith":
       fields = [x for x in contenders if x.startswith(expression)]
    elif expander.lower() == "contains":
        fields = [x for x in contenders if expression in x]
    elif expander.lower() == "allfields":
        fields = contenders

The code also includes a 'contains' in addition to endswith and startswith. The use case is for something like Date and DateTime:

JITTER contains:Date 123

Covers both fields like StudyDate and AcquisitionDateTime. Something else that tends to be in the middle of a field is "SOP" like Referenced SOP Class UID in File:

REMOVE contains:SOP

howardpchen commented 5 years ago

Actually re: all fields I think it does make more sense to do it without the colon for the user, as you suggested:

REPLACE ALL func:clean_data

I feel pretty good that All won't become a standard DICOM tag soon :)

vsoch commented 5 years ago

With the last amendment, I agree on all fronts! It wouldn't be good to have something like

JITTER allFields: 123

to indicate all fields, because we would want to catch that as an error that the user possibly parsed the command incorrectly.

vsoch commented 5 years ago

I'm having dinner now, but I'll be happy to put this in for you tomorrow (unless you get to it first!) Thanks for the good suggestion.

vsoch commented 5 years ago

hey @howardpchen ! I've done a PR to work on this issue. Would you be able to review when you get a chance? https://github.com/pydicom/deid/pull/88 No rush, it's Sunday firstly, and I'm going to go out for a run and will be back later anyway :) Happy weekend!