pydicom / deid

best effort anonymization for medical images using python
https://pydicom.github.io/deid/
MIT License
140 stars 43 forks source link

Finding relocated private elements using "PrivateCreator" #205

Open moloney opened 2 years ago

moloney commented 2 years ago

Private elements can be relocated (group number will change) and some popular PACS will absolutely do this to data passing through them. Instead of looking up an absolute (group_number, element_number) in rules, we should allow looking up private elements with a PrivateCreator plus an element_number. Then when processing a dataset you look for a matching PrivateCreator string (e.g. "SIEMENS CSA HEADER") to find the private group number dynamically.

I'm not sure how this capability should be exposed in the recipes, any thoughts?

vsoch commented 2 years ago

Can you give me a little example in pseudo code? E.g., the idea is that the PrivateCreator is a string and it's use in place of the group number?

wetzelj commented 2 years ago

@moloney - Would a %values section work for you?

With a values section, you can define a field (or list of fields) that you want to use to define a set of values. This values list would then be used in conjunction with a REMOVE or other command to change the identified tags.

%values manufacturer_values
FIELD Manufacturer

%header
REMOVE values:manufacturer_values

Header Before PatientName - John Doe Manufacturer - Siemens (0009,0009) - SIEMENS CSA HEADER

Header After PatientName - John Doe

If that doesn't work, a custom function may be another option for you. The custom function could be written to be grab the value of the tag being acted on, look for "SIEMENS CSA HEADER" and then return the appropriate value for the action that you want to take.

jstorrs commented 2 years ago

I don't know if this background information is helpful or not but private elements are tricky in DICOM because it's designed to allow multiple vendors to add private tags without stepping on each other's toes (if they do it correctly) and pydicom is great because it understands this. Often people get things directly from modalities and don't encounter these things.

Generally we think of private tags as similar to DICOM standard tags as having fully-specified group element pairs. But for private tags there's a layer of indirection via the Private Creator.

For example here is a Siemens CSA group (used a lot in MRI research):

(0029,0010) LO [SIEMENS CSA HEADER]                     #  18, 1 PrivateCreator
(0029,0011) LO [SIEMENS MEDCOM HEADER2]                 #  22, 1 PrivateCreator
(0029,1008) CS [IMAGE NUM 4]                            #  12, 1 Unknown Tag & Data
(0029,1009) LO [20211010]                               #   8, 1 Unknown Tag & Data
(0029,1010) OB 53\56\31\30\04\03\02\01\65\00\00\00\4d\00\00\00\45\63\68\6f\4c\69... # 11008, 1 Unknown Tag & Data
(0029,1018) CS [MR]                                     #   2, 1 Unknown Tag & Data
(0029,1019) LO [20211010]                               #   8, 1 Unknown Tag & Data
(0029,1020) OB 53\56\31\30\04\03\02\01\4f\00\00\00\4d\00\00\00\55\73\65\64\50\61... # 93880, 1 Unknown Tag & Data
(0029,1160) LO [com]                                    #   4, 1 Unknown Tag & Data

The things to pay attention to with private tags are these patterns

(GGGG,00XX) LO [NAME]  # PrivateCreator
(GGGG,XX01) item
(GGGG,XX02) item
...

The first two lines in the above example are "reservations" that declare:

Those specific numbers 0x10 and 0x11 can be remapped by DICOM processors. So for example it is entirely valid to move the entire "SIEMENS CSA HEADER" private block from 0x0029,0x10... to, say, 0x0029,0x27... as long as the group is preserved.

(0029,0011) LO [SIEMENS MEDCOM HEADER2]                 #  22, 1 PrivateCreator
(0029,0027) LO [SIEMENS CSA HEADER]                     #  18, 1 PrivateCreator
(0029,1160) LO [com]                                    #   4, 1 Unknown Tag & Data
(0029,2708) CS [IMAGE NUM 4]                            #  12, 1 Unknown Tag & Data
(0029,2709) LO [20211010]                               #   8, 1 Unknown Tag & Data
(0029,2710) OB 53\56\31\30\04\03\02\01\65\00\00\00\4d\00\00\00\45\63\68\6f\4c\69... # 11008, 1 Unknown Tag & Data
(0029,2718) CS [MR]                                     #   2, 1 Unknown Tag & Data
(0029,2719) LO [20211010]                               #   8, 1 Unknown Tag & Data
(0029,2720) OB 53\56\31\30\04\03\02\01\4f\00\00\00\4d\00\00\00\55\73\65\64\50\61... # 93880, 1 Unknown Tag & Data

The general pattern for this is:

(GGGG,00XX) LO [NAME OF PRIVATE BLOCK]
(GGGG,XXYY) .... items within the block 

Since the location of the block within the group is "arbitrary", one of the nice things about pydicom's dataset is that you can select private items like this:

ds.get_private_tag(0x0029,0x10,"SIEMENS CSA HEADER")
vsoch commented 2 years ago

Ah that's handy! @moloney can you give me an example file and some set of actions you want to do and I can try this out?

jstorrs commented 2 years ago

Not the OP but one thing that I'm working on is a custom function to clean the Siemens CSA headers at least to the point that the contents are vouched by the deidentifier and that dcm2niix/gdcm etc work as expected on the output. CSA headers don't seem to be terribly complex, but they are big and scary. Generally the approach is to just throw them out because they cannot be vouched (which is a big challenge because they're used to build NiFTI and CROs have demanded they are maintained which is a different story). The reality is that CSA data blobs do contain things like null-terminated strings stored within larger fixed-length fields that were not initialized prior to use so even if the CSA dumps look good, the "dead space" contains who-knows-what. But the locations of dead space in the CSA is knowable and can just be overwritten with nulls etc.

vsoch commented 2 years ago

@jstorrs if you get something working and would like to contribute here, it would be hugely welcome!

jstorrs commented 1 year ago

I've been thinking about a couple approaches to this and familiarizing myself with the deid codebase. The solution I'd like to try is to add a new dictionary: section to DICOM deid recipes. It would contain typical DICOM dictionary definition lines that define keywords and can be used to track the private creator. Basically the section would feed pydicom.datadict.add_dict_entry() and pydicom.datadict.add_private_dict_entry(). I'll start working on this over the next few days.

vsoch commented 1 year ago

That's a cool idea! Can you spec out an example for discussion?

jstorrs commented 1 year ago

Odd question and I don't know if anyone here has an answer... is it valid for the same private creator string to be reused for multiple reservations in the same block? i.e. suppose you have:

(0021,0010) LO [MY PRIVATE CREATOR] (0021,0011) LO [MY PRIVATE CREATOR] (0021,0012) LO [MY PRIVATE CREATOR] (0021,0013) LO [MY PRIVATE CREATOR] (0021,1001) LO [VALUE 1] (0021,1101) LO [VALUE 2] (0021,1201) DS [3] (0021,1301) LO [VALUE 4]

I haven't been able to find anything in the standard that forbids this. But then if you encounter a file like this, I'd flag it as obviously something's amok. This is obviously going to be a challenge for this sort of thing. I'll see whether pydicom or dcmtk has thought about this weird case.

I have not encountered this in the wild, I'm just pondering how to handle this. Previously the thought was that (0x0021,"MY PRIVATE CREATOR",0x01) must obviously be unique and duplicate private creators are forbidden, but I can't find where/if that's specified in the standard.

https://dicom.nema.org/dicom/2013/output/chtml/part05/sect_7.8.html

jstorrs commented 1 year ago

Edit: nevermind I just realized after posting the URL that I was looking at the 2013 version (somehow comes to the top of Google for me). The latest version forbids it and says if there's a need for some reason the implementation should use sequences.

https://dicom.nema.org/medical/dicom/current/output/chtml/part05/sect_7.8.html