pydicom / deid

best effort anonymization for medical images using python
https://pydicom.github.io/deid/
MIT License
140 stars 43 forks source link

Fix/infor 646 #178

Closed vsoch closed 3 years ago

vsoch commented 3 years ago

Description

Related issues: # https://github.com/pydicom/deid/issues/173

This will hopefully be a fix to the issue above! The bug is that we have two places to store the fields and values - the self.dicom and self.fields. The self.fields are operated on always to get and update values, but in this case we were updating fields but not the self.dicom, given that the tag was already there. I also ran black (and removed the version pin).

wetzelj commented 3 years ago

I see... The change looks good to me. Obviously the unit test passes now, but I also tested it with the production samples where we discovered this issue.

So really, there were two little bugs that came together here.

  1. When jittering, only the self.dicom was being updated... this was mostly asymptomatic in the past because we use self.dicom when writing the files and as a result all appeared to be okay, even though it wasn't.
  2. When replacing values, the self.fields was being updated - self.dicom was not explicitly updated.

Now, the code is performing explicit updates of the value in both locations.

If you don't mind providing a little python lesson... :) The part that I'm still unclear on - and this is due to a deficiency in my understanding of python - prior to this change, an independent replace that wasn't preceded by a jitter still functioned with the only explicit value modification being the replace of the value in self.fields. It was acting as if self.fields and self.dicom were hitting the same value in memory. The preceding jitter seemed to introduce a break between these two references - at which point the update of the value in self.fields didn't update self.dicom. Is this behavior occurring because jitter_timestamp is a standalone helper method and not a parser class method?

vsoch commented 3 years ago

I don't think I have a perfect answer, but I can say that the bug resulted from having two ways of doing this (the older one with using the dicom and then the updated one with the parser and saving fields) that (as we discussed and you pointed out) weren't consistent. The two areas that stuck out to me as causing this specific problem were here: https://github.com/pydicom/deid/blob/ad37ee4cc8339bd3bdebabca33274e4cc63fe817/deid/dicom/parser.py#L390

A call to replace actually calls add_field,

                if uid in self.fields:
                    element = self.fields[uid]

                    # Nested fields
                    while not hasattr(element, "value"):
                        element = element.element
                    element.value = value

                else:
                    element = DataElement(tag["tag"], tag["VR"], value)
                    self.dicom.add(element)
                    self.fields[uid] = DicomField(element, name, uid)

and you'll notice if the uid is already in fields, we just update the value, but we don't modify the dicom. The top loop is missing an update to self.dicom, which I added here. I think the assumption was that the element updated points to the same object in the dicom? The jitter timestamp was the opposite - we made changes to self.dicom without updating fields, here: https://github.com/pydicom/deid/blob/ad37ee4cc8339bd3bdebabca33274e4cc63fe817/deid/dicom/parser.py#L432. Again, likely the assumption being that the update to self.dicom also updates the object in self.fields (which is the same reference, at least if I remember a long time ago from testing).

I don't really have a good answer for why using JITTER seemed to break things, but I suspect there was something out of sync with respect to the two places (self.dicom and self.fields). I'd have to debug it again to check, but if I remember correctly we wound up with the result of the jitter instead of the result of the final replace.