Closed skarrea closed 3 years ago
Thank you for the report, and the clear way to reproduce! I'm testing locally and I've reproduced, and I'll share what I've found. It does look like the replacement function itself works okay - here we see that all the occurrences of the instance UIDs are changed to modified:
: parser.fields
Out[6]:
{'(0008, 0012)': (0008, 0012) Instance Creation Date DA: '20201001' [InstanceCreationDate],
'(0008, 0033)': (0008, 0033) Content Time TM: '20201001' [ContentTime],
'(0008, 1250)': (0008, 1250) Related Series Sequence SQ: <Sequence, length 3> [RelatedSeriesSequence],
'(0020, 000d)': (0020, 000d) Study Instance UID UI: modified [StudyInstanceUID],
'(0020, 000e)': (0020, 000e) Series Instance UID UI: modified [SeriesInstanceUID],
'(0010, 0010)': (0010, 0010) Patient's Name PN: 'ReproducingTheError' [PatientName],
'(0008, 1250)__0__(0020, 000d)': (0020, 000d) Study Instance UID UI: modified [RelatedSeriesSequence__StudyInstanceUID],
'(0008, 1250)__0__(0020, 000e)': (0020, 000e) Series Instance UID UI: modified [RelatedSeriesSequence__SeriesInstanceUID],
'(0008, 1250)__1__(0020, 000d)': (0020, 000d) Study Instance UID UI: modified [RelatedSeriesSequence__StudyInstanceUID],
'(0008, 1250)__1__(0020, 000e)': (0020, 000e) Series Instance UID UI: modified [RelatedSeriesSequence__SeriesInstanceUID],
'(0008, 1250)__2__(0020, 000d)': (0020, 000d) Study Instance UID UI: modified [RelatedSeriesSequence__StudyInstanceUID],
'(0008, 1250)__2__(0020, 000e)': (0020, 000e) Series Instance UID UI: modified [RelatedSeriesSequence__SeriesInstanceUID]}
However the bug seems to be in actually using those fields to do the save, because the resulting file appears as you've showed. I suspect it's indexing. In the DeidParser perform_action we do grab the fields correctly:
fields
{'(0020, 000d)': (0020, 000d) Study Instance UID UI: RelStudy0 [StudyInstanceUID],
'(0008, 1250)__0__(0020, 000d)': (0020, 000d) Study Instance UID UI: RelStudy0 [RelatedSeriesSequence__StudyInstanceUID],
'(0008, 1250)__1__(0020, 000d)': (0020, 000d) Study Instance UID UI: RelStudy1 [RelatedSeriesSequence__StudyInstanceUID],
'(0008, 1250)__2__(0020, 000d)': (0020, 000d) Study Instance UID UI: RelStudy2 [RelatedSeriesSequence__StudyInstanceUID]}
Ah so I found the bug (now we are in the class add_value method. The issue is that the dicom generated doesn't have a value attribute, likely because you created it on the fly. The reason it has the field but then only replaces the nested ones is because we use not having the value directly as an indicator that there is a nested field, e.g.,
# Nested fields
while not hasattr(element, "value"):
element = element.element
I don't know why, but the iteration through the self.fields seems to change both instances (the nested and non-nested) but doesn't actual edit the dicom. Here is an example - see that modified appears twice in self.fields, but only once in the self.dicom
self.fields
{'(0008, 0012)': (0008, 0012) Instance Creation Date DA: '20201001' [InstanceCreationDate],
'(0008, 0033)': (0008, 0033) Content Time TM: '20201001' [ContentTime],
'(0008, 1250)': (0008, 1250) Related Series Sequence SQ: <Sequence, length 3> [RelatedSeriesSequence],
'(0020, 000d)': (0020, 000d) Study Instance UID UI: modified [StudyInstanceUID],
'(0020, 000e)': (0020, 000e) Series Instance UID UI: RelSeries0 [SeriesInstanceUID],
'(0010, 0010)': (0010, 0010) Patient's Name PN: 'ReproducingTheError' [PatientName],
'(0008, 1250)__0__(0020, 000d)': (0020, 000d) Study Instance UID UI: modified [RelatedSeriesSequence__StudyInstanceUID],
'(0008, 1250)__0__(0020, 000e)': (0020, 000e) Series Instance UID UI: RelSeries0 [RelatedSeriesSequence__SeriesInstanceUID],
'(0008, 1250)__1__(0020, 000d)': (0020, 000d) Study Instance UID UI: RelStudy1 [RelatedSeriesSequence__StudyInstanceUID],
'(0008, 1250)__1__(0020, 000e)': (0020, 000e) Series Instance UID UI: RelSeries1 [RelatedSeriesSequence__SeriesInstanceUID],
'(0008, 1250)__2__(0020, 000d)': (0020, 000d) Study Instance UID UI: RelStudy2 [RelatedSeriesSequence__StudyInstanceUID],
'(0008, 1250)__2__(0020, 000e)': (0020, 000e) Series Instance UID UI: RelSeries2 [RelatedSeriesSequence__SeriesInstanceUID]}
> self.dicom
(0008, 0012) Instance Creation Date DA: '20201001'
(0008, 0033) Content Time TM: '20201001'
(0008, 1250) Related Series Sequence 3 item(s) ----
(0020, 000d) Study Instance UID UI: modified
(0020, 000e) Series Instance UID UI: RelSeries0
---------
(0020, 000d) Study Instance UID UI: RelStudy1
(0020, 000e) Series Instance UID UI: RelSeries1
---------
(0020, 000d) Study Instance UID UI: RelStudy2
(0020, 000e) Series Instance UID UI: RelSeries2
---------
(0010, 0010) Patient's Name PN: 'ReproducingTheError'
(0020, 000d) Study Instance UID UI: SeriesUID
(0020, 000e) Series Instance UID UI: StudyUID
The way this works is that (technically) the self.fields should always have a DicomField class that references the same element in the dicom data, that way when we edit one, the other edits as well. But this doesn't appear to be happening for some reason.
I tried removing the deepcopy of the fields (added recently) and this isn't the issue - we still don't change the first one. I then moved earlier in the code and just tried manually changing a value - this hints that perhaps the reference to the top level field is incorrect, in fact it's pointing to the one that comes much later!
field
Out[5]: (0020, 000d) Study Instance UID UI: RelStudy0 [StudyInstanceUID]
In [6]: field.element.value
Out[6]: 'RelStudy0'
In [7]: self.dicom
Out[7]:
(0008, 0012) Instance Creation Date DA: '20201001'
(0008, 0033) Content Time TM: '20201001'
(0008, 1250) Related Series Sequence 3 item(s) ----
(0020, 000d) Study Instance UID UI: RelStudy0
(0020, 000e) Series Instance UID UI: RelSeries0
---------
(0020, 000d) Study Instance UID UI: RelStudy1
(0020, 000e) Series Instance UID UI: RelSeries1
---------
(0020, 000d) Study Instance UID UI: RelStudy2
(0020, 000e) Series Instance UID UI: RelSeries2
---------
(0010, 0010) Patient's Name PN: 'ReproducingTheError'
(0020, 000d) Study Instance UID UI: SeriesUID
(0020, 000e) Series Instance UID UI: StudyUID
In [8]: field.element.value
Out[8]: 'RelStudy0'
In [9]: field.element.value = 'ADIFFERENTONE'
In [10]: self.dicom
Out[10]:
(0008, 0012) Instance Creation Date DA: '20201001'
(0008, 0033) Content Time TM: '20201001'
(0008, 1250) Related Series Sequence 3 item(s) ----
(0020, 000d) Study Instance UID UI: ADIFFERENTONE
(0020, 000e) Series Instance UID UI: RelSeries0
---------
(0020, 000d) Study Instance UID UI: RelStudy1
(0020, 000e) Series Instance UID UI: RelSeries1
---------
(0020, 000d) Study Instance UID UI: RelStudy2
(0020, 000e) Series Instance UID UI: RelSeries2
---------
(0010, 0010) Patient's Name PN: 'ReproducingTheError'
(0020, 000d) Study Instance UID UI: SeriesUID
(0020, 000e) Series Instance UID UI: StudyUID
In the above, since we indexed uid (0008, 0012)
we would expect the top level one to change, but that isn't the case. So then I stepped back and looked at fields, get_fields method, and it appears that the issue arises because we use dataset.iterall(), which essentially flattens the dicom fields (and gets them mixed up). Here is an example of using iterall vs. just items():
In [19]: for key, value in dataset.items():
...: print(key)
...: print(value)
...:
(0008, 0012)
(0008, 0012) Instance Creation Date DA: '20201001'
(0008, 0033)
(0008, 0033) Content Time TM: '20201001'
(0008, 1250)
(0008, 1250) Related Series Sequence SQ: <Sequence, length 3>
(0010, 0010)
(0010, 0010) Patient's Name PN: 'ReproducingTheError'
(0020, 000d)
(0020, 000d) Study Instance UID UI: SeriesUID
(0020, 000e)
(0020, 000e) Series Instance UID UI: StudyUID
In [20]: for item in dataset.iterall():
...: print(item)
...:
...:
(0008, 0012) Instance Creation Date DA: '20201001'
(0008, 0033) Content Time TM: '20201001'
(0008, 1250) Related Series Sequence SQ: <Sequence, length 3>
(0020, 000d) Study Instance UID UI: RelStudy0
(0020, 000e) Series Instance UID UI: RelSeries0
(0020, 000d) Study Instance UID UI: RelStudy1
(0020, 000e) Series Instance UID UI: RelSeries1
(0020, 000d) Study Instance UID UI: RelStudy2
(0020, 000e) Series Instance UID UI: RelSeries2
(0010, 0010) Patient's Name PN: 'ReproducingTheError'
(0020, 000d) Study Instance UID UI: SeriesUID
(0020, 000e) Series Instance UID UI: StudyUID
So I've been interactively debugging for a few hours this morning - I know the issue is here with respect to how we iterate through the elements and save the association, but I haven't yet figured out what's wrong / how to fix it. @wetzelj do you have any ideas?
oh oops, I spoke too soon... I think I fixed it! Let me put in a PR.
I've only cursorily read this (and have to hop on to a meeting now), but wasn't the iterall introduced to enable us to act on sequences and sub-sequences as well? Does removing the iterall break anything with detecting data in those subsequences?
Here is the PR! https://github.com/pydicom/deid/pull/154
@wetzelj actually I don't think so! The reason is that we already loop through a sequence, discover that it is a sequence, and then add it's values as datasets to parse through. So we sort of unwrap the list. I think this issue likely arose because this function first just iterated through fields (and did it's own manual extraction of nested) but then you showed me the magic of iterall and I changed it, and I suspect we didn't fully test the replace again. That said, possibly now we lost something that iterall was providing? Please take a look at the PR and let me know what you think!
Sounds good. I ran some tests and it seems to be okay - I didn't notice anything being lost, just the new corrected functionality. :) I'll keep an eye out as I continue development/testing.
Fixed with #154
It seems that Deid does not manage to find and change the the Study and Series Instance UID tags that are outside the Related Series Sequence when Related Series Sequence tags are present.
I've appended a (almost) minimally working example reproducing the error I see in my data. The python code first generates a dicom file and then tries to clean it by changing
contains:StudyInstanceUID
andcontains:SeriesInstanceUID
from their original values tomodified
.A copy of the output is copied below. Ignore the weird warning formatting it is due to the windows console not having colors.
Producing code:
Deid recipe