pydicom / deid

best effort anonymization for medical images using python
https://pydicom.github.io/deid/
MIT License
140 stars 43 forks source link

Study Instance UID and Series Instance UID not changed when dicom contains Related Series Sequence tags #153

Closed skarrea closed 3 years ago

skarrea commented 3 years ago

It seems that Deid does not manage to find and change the the Study and Series Instance UID tags that are outside the Related Series Sequence when Related Series Sequence tags are present.

I've appended a (almost) minimally working example reproducing the error I see in my data. The python code first generates a dicom file and then tries to clean it by changing contains:StudyInstanceUID and contains:SeriesInstanceUID from their original values to modified.

A copy of the output is copied below. Ignore the weird warning formatting it is due to the windows console not having colors.

################################################################################
Original file
################################################################################
(0008, 0012) Instance Creation Date              DA: '20201001'
(0008, 0033) Content Time                        TM: '20201001'
(0008, 1250)  Related Series Sequence   3 item(s) ----
   (0020, 000d) Study Instance UID                  UI: RelStudy0
   (0020, 000e) Series Instance UID                 UI: RelSeries0
   ---------
   (0020, 000d) Study Instance UID                  UI: RelStudy1
   (0020, 000e) Series Instance UID                 UI: RelSeries1
   ---------
   (0020, 000d) Study Instance UID                  UI: RelStudy2
   (0020, 000e) Series Instance UID                 UI: RelSeries2
   ---------
(0010, 0010) Patient's Name                      PN: 'ReproducingTheError'
(0020, 000d) Study Instance UID                  UI: SeriesUID
(0020, 000e) Series Instance UID                 UI: StudyUID
File saved.
WARNING No specification, loading default base deid.dicom

################################################################################
Cleaned file
################################################################################
(0008, 0012) Instance Creation Date              DA: '20201001'
(0008, 0033) Content Time                        TM: '20201001'
(0008, 1250)  Related Series Sequence   3 item(s) ----
   (0020, 000d) Study Instance UID                  UI: modified
   (0020, 000e) Series Instance UID                 UI: modified
   ---------
   (0020, 000d) Study Instance UID                  UI: modified
   (0020, 000e) Series Instance UID                 UI: modified
   ---------
   (0020, 000d) Study Instance UID                  UI: modified
   (0020, 000e) Series Instance UID                 UI: modified
   ---------
(0010, 0010) Patient's Name                      PN: 'ReproducingTheError'
(0020, 000d) Study Instance UID                  UI: SeriesUID
(0020, 000e) Series Instance UID                 UI: StudyUID

Producing code:

import os
import datetime
import tempfile
from deid.config import DeidRecipe
from deid.dicom import get_files, get_identifiers, replace_identifiers

from pydicom.dataset import Dataset, FileDataset
from pydicom.uid import generate_uid
from pydicom.dataset import Dataset
from pydicom.sequence import Sequence

recipe = DeidRecipe('test.dicom')

# Create some temporary filenames
suffix = '.dcm'
filename_little_endian = tempfile.NamedTemporaryFile(suffix=suffix).name
filename_big_endian = tempfile.NamedTemporaryFile(suffix=suffix).name

print("Setting dataset values...")
# Create the FileDataset instance (initially no data elements, but file_meta
# supplied)
ds = FileDataset(filename_little_endian, {},
                 preamble=b"\0" * 128)

# Add the data elements -- not trying to set all required here. Check DICOM
# standard

# Set the transfer syntax
ds.is_little_endian = True
ds.is_implicit_VR = True

# Referemced image sequence

ds.PatientName = 'ReproducingTheError'

dt = datetime.datetime.now()
ds.InstanceCreationDate = dt.strftime('%Y%m%d')
ds.ContentTime = dt.strftime('%Y%m%d')

studyUID = generate_uid()
rel_im_ser_list = []
ds.RelatedSeriesSequence = Sequence([Dataset(), Dataset(), Dataset()])
for i in range(3):
    ds.RelatedSeriesSequence[i].StudyInstanceUID = 'RelStudy' + str(i)
    ds.RelatedSeriesSequence[i].SeriesInstanceUID = 'RelSeries' + str(i)

ds.SeriesInstanceUID = 'StudyUID'
ds.StudyInstanceUID = 'SeriesUID'

print(2*'\n')
print(80*'#')
print('Original file')
print(80*'#')
print(ds)

ds.save_as(filename_little_endian)
print("File saved.")

files = list(get_files(filename_little_endian))

items = get_identifiers(files)

for item in items:
    items[item]['new_val'] = 'modified'

cleaned_files = replace_identifiers(files,
                                    deid = recipe,
                                    ids = items)

print(2*'\n')
print(80*'#')
print('Cleaned file')
print(80*'#')

print(cleaned_files[0])

os.remove(filename_little_endian)

Deid recipe

FORMAT dicom

%header

REPLACE contains:StudyInstanceUID var:new_val
REPLACE contains:SeriesInstanceUID var:new_val
vsoch commented 3 years ago

Thank you for the report, and the clear way to reproduce! I'm testing locally and I've reproduced, and I'll share what I've found. It does look like the replacement function itself works okay - here we see that all the occurrences of the instance UIDs are changed to modified:

: parser.fields                                                                                                                            
Out[6]: 
{'(0008, 0012)': (0008, 0012) Instance Creation Date              DA: '20201001'  [InstanceCreationDate],
 '(0008, 0033)': (0008, 0033) Content Time                        TM: '20201001'  [ContentTime],
 '(0008, 1250)': (0008, 1250) Related Series Sequence             SQ: <Sequence, length 3>  [RelatedSeriesSequence],
 '(0020, 000d)': (0020, 000d) Study Instance UID                  UI: modified  [StudyInstanceUID],
 '(0020, 000e)': (0020, 000e) Series Instance UID                 UI: modified  [SeriesInstanceUID],
 '(0010, 0010)': (0010, 0010) Patient's Name                      PN: 'ReproducingTheError'  [PatientName],
 '(0008, 1250)__0__(0020, 000d)': (0020, 000d) Study Instance UID                  UI: modified  [RelatedSeriesSequence__StudyInstanceUID],
 '(0008, 1250)__0__(0020, 000e)': (0020, 000e) Series Instance UID                 UI: modified  [RelatedSeriesSequence__SeriesInstanceUID],
 '(0008, 1250)__1__(0020, 000d)': (0020, 000d) Study Instance UID                  UI: modified  [RelatedSeriesSequence__StudyInstanceUID],
 '(0008, 1250)__1__(0020, 000e)': (0020, 000e) Series Instance UID                 UI: modified  [RelatedSeriesSequence__SeriesInstanceUID],
 '(0008, 1250)__2__(0020, 000d)': (0020, 000d) Study Instance UID                  UI: modified  [RelatedSeriesSequence__StudyInstanceUID],
 '(0008, 1250)__2__(0020, 000e)': (0020, 000e) Series Instance UID                 UI: modified  [RelatedSeriesSequence__SeriesInstanceUID]}

However the bug seems to be in actually using those fields to do the save, because the resulting file appears as you've showed. I suspect it's indexing. In the DeidParser perform_action we do grab the fields correctly:

fields
{'(0020, 000d)': (0020, 000d) Study Instance UID                  UI: RelStudy0  [StudyInstanceUID],
 '(0008, 1250)__0__(0020, 000d)': (0020, 000d) Study Instance UID                  UI: RelStudy0  [RelatedSeriesSequence__StudyInstanceUID],
 '(0008, 1250)__1__(0020, 000d)': (0020, 000d) Study Instance UID                  UI: RelStudy1  [RelatedSeriesSequence__StudyInstanceUID],
 '(0008, 1250)__2__(0020, 000d)': (0020, 000d) Study Instance UID                  UI: RelStudy2  [RelatedSeriesSequence__StudyInstanceUID]}

Ah so I found the bug (now we are in the class add_value method. The issue is that the dicom generated doesn't have a value attribute, likely because you created it on the fly. The reason it has the field but then only replaces the nested ones is because we use not having the value directly as an indicator that there is a nested field, e.g.,

# Nested fields
while not hasattr(element, "value"):
    element = element.element

I don't know why, but the iteration through the self.fields seems to change both instances (the nested and non-nested) but doesn't actual edit the dicom. Here is an example - see that modified appears twice in self.fields, but only once in the self.dicom

 self.fields
{'(0008, 0012)': (0008, 0012) Instance Creation Date              DA: '20201001'  [InstanceCreationDate],
 '(0008, 0033)': (0008, 0033) Content Time                        TM: '20201001'  [ContentTime],
 '(0008, 1250)': (0008, 1250) Related Series Sequence             SQ: <Sequence, length 3>  [RelatedSeriesSequence],
 '(0020, 000d)': (0020, 000d) Study Instance UID                  UI: modified  [StudyInstanceUID],
 '(0020, 000e)': (0020, 000e) Series Instance UID                 UI: RelSeries0  [SeriesInstanceUID],
 '(0010, 0010)': (0010, 0010) Patient's Name                      PN: 'ReproducingTheError'  [PatientName],
 '(0008, 1250)__0__(0020, 000d)': (0020, 000d) Study Instance UID                  UI: modified  [RelatedSeriesSequence__StudyInstanceUID],
 '(0008, 1250)__0__(0020, 000e)': (0020, 000e) Series Instance UID                 UI: RelSeries0  [RelatedSeriesSequence__SeriesInstanceUID],
 '(0008, 1250)__1__(0020, 000d)': (0020, 000d) Study Instance UID                  UI: RelStudy1  [RelatedSeriesSequence__StudyInstanceUID],
 '(0008, 1250)__1__(0020, 000e)': (0020, 000e) Series Instance UID                 UI: RelSeries1  [RelatedSeriesSequence__SeriesInstanceUID],
 '(0008, 1250)__2__(0020, 000d)': (0020, 000d) Study Instance UID                  UI: RelStudy2  [RelatedSeriesSequence__StudyInstanceUID],
 '(0008, 1250)__2__(0020, 000e)': (0020, 000e) Series Instance UID                 UI: RelSeries2  [RelatedSeriesSequence__SeriesInstanceUID]}

> self.dicom
(0008, 0012) Instance Creation Date              DA: '20201001'
(0008, 0033) Content Time                        TM: '20201001'
(0008, 1250)  Related Series Sequence   3 item(s) ---- 
   (0020, 000d) Study Instance UID                  UI: modified
   (0020, 000e) Series Instance UID                 UI: RelSeries0
   ---------
   (0020, 000d) Study Instance UID                  UI: RelStudy1
   (0020, 000e) Series Instance UID                 UI: RelSeries1
   ---------
   (0020, 000d) Study Instance UID                  UI: RelStudy2
   (0020, 000e) Series Instance UID                 UI: RelSeries2
   ---------
(0010, 0010) Patient's Name                      PN: 'ReproducingTheError'
(0020, 000d) Study Instance UID                  UI: SeriesUID
(0020, 000e) Series Instance UID                 UI: StudyUID

The way this works is that (technically) the self.fields should always have a DicomField class that references the same element in the dicom data, that way when we edit one, the other edits as well. But this doesn't appear to be happening for some reason.

I tried removing the deepcopy of the fields (added recently) and this isn't the issue - we still don't change the first one. I then moved earlier in the code and just tried manually changing a value - this hints that perhaps the reference to the top level field is incorrect, in fact it's pointing to the one that comes much later!

field
Out[5]: (0020, 000d) Study Instance UID                  UI: RelStudy0  [StudyInstanceUID]

In [6]: field.element.value
Out[6]: 'RelStudy0'

In [7]: self.dicom
Out[7]: 
(0008, 0012) Instance Creation Date              DA: '20201001'
(0008, 0033) Content Time                        TM: '20201001'
(0008, 1250)  Related Series Sequence   3 item(s) ---- 
   (0020, 000d) Study Instance UID                  UI: RelStudy0
   (0020, 000e) Series Instance UID                 UI: RelSeries0
   ---------
   (0020, 000d) Study Instance UID                  UI: RelStudy1
   (0020, 000e) Series Instance UID                 UI: RelSeries1
   ---------
   (0020, 000d) Study Instance UID                  UI: RelStudy2
   (0020, 000e) Series Instance UID                 UI: RelSeries2
   ---------
(0010, 0010) Patient's Name                      PN: 'ReproducingTheError'
(0020, 000d) Study Instance UID                  UI: SeriesUID
(0020, 000e) Series Instance UID                 UI: StudyUID

In [8]: field.element.value
Out[8]: 'RelStudy0'

In [9]: field.element.value = 'ADIFFERENTONE'

In [10]: self.dicom
Out[10]: 
(0008, 0012) Instance Creation Date              DA: '20201001'
(0008, 0033) Content Time                        TM: '20201001'
(0008, 1250)  Related Series Sequence   3 item(s) ---- 
   (0020, 000d) Study Instance UID                  UI: ADIFFERENTONE
   (0020, 000e) Series Instance UID                 UI: RelSeries0
   ---------
   (0020, 000d) Study Instance UID                  UI: RelStudy1
   (0020, 000e) Series Instance UID                 UI: RelSeries1
   ---------
   (0020, 000d) Study Instance UID                  UI: RelStudy2
   (0020, 000e) Series Instance UID                 UI: RelSeries2
   ---------
(0010, 0010) Patient's Name                      PN: 'ReproducingTheError'
(0020, 000d) Study Instance UID                  UI: SeriesUID
(0020, 000e) Series Instance UID                 UI: StudyUID

In the above, since we indexed uid (0008, 0012) we would expect the top level one to change, but that isn't the case. So then I stepped back and looked at fields, get_fields method, and it appears that the issue arises because we use dataset.iterall(), which essentially flattens the dicom fields (and gets them mixed up). Here is an example of using iterall vs. just items():

In [19]: for key, value in dataset.items():
    ...:     print(key)
    ...:     print(value)
    ...:
(0008, 0012)
(0008, 0012) Instance Creation Date              DA: '20201001'
(0008, 0033)
(0008, 0033) Content Time                        TM: '20201001'
(0008, 1250)
(0008, 1250) Related Series Sequence             SQ: <Sequence, length 3>
(0010, 0010)
(0010, 0010) Patient's Name                      PN: 'ReproducingTheError'
(0020, 000d)
(0020, 000d) Study Instance UID                  UI: SeriesUID
(0020, 000e)
(0020, 000e) Series Instance UID                 UI: StudyUID

In [20]: for item in dataset.iterall():
    ...:     print(item)
    ...:
    ...:
(0008, 0012) Instance Creation Date              DA: '20201001'
(0008, 0033) Content Time                        TM: '20201001'
(0008, 1250) Related Series Sequence             SQ: <Sequence, length 3>
(0020, 000d) Study Instance UID                  UI: RelStudy0
(0020, 000e) Series Instance UID                 UI: RelSeries0
(0020, 000d) Study Instance UID                  UI: RelStudy1
(0020, 000e) Series Instance UID                 UI: RelSeries1
(0020, 000d) Study Instance UID                  UI: RelStudy2
(0020, 000e) Series Instance UID                 UI: RelSeries2
(0010, 0010) Patient's Name                      PN: 'ReproducingTheError'
(0020, 000d) Study Instance UID                  UI: SeriesUID
(0020, 000e) Series Instance UID                 UI: StudyUID

So I've been interactively debugging for a few hours this morning - I know the issue is here with respect to how we iterate through the elements and save the association, but I haven't yet figured out what's wrong / how to fix it. @wetzelj do you have any ideas?

vsoch commented 3 years ago

oh oops, I spoke too soon... I think I fixed it! Let me put in a PR.

wetzelj commented 3 years ago

I've only cursorily read this (and have to hop on to a meeting now), but wasn't the iterall introduced to enable us to act on sequences and sub-sequences as well? Does removing the iterall break anything with detecting data in those subsequences?

vsoch commented 3 years ago

Here is the PR! https://github.com/pydicom/deid/pull/154

@wetzelj actually I don't think so! The reason is that we already loop through a sequence, discover that it is a sequence, and then add it's values as datasets to parse through. So we sort of unwrap the list. I think this issue likely arose because this function first just iterated through fields (and did it's own manual extraction of nested) but then you showed me the magic of iterall and I changed it, and I suspect we didn't fully test the replace again. That said, possibly now we lost something that iterall was providing? Please take a look at the PR and let me know what you think!

wetzelj commented 3 years ago

Sounds good. I ran some tests and it seems to be okay - I didn't notice anything being lost, just the new corrected functionality. :) I'll keep an eye out as I continue development/testing.

vsoch commented 3 years ago

Fixed with #154