pydicom / deid

best effort anonymization for medical images using python
https://pydicom.github.io/deid/
MIT License
140 stars 43 forks source link

Adding start to work for extracting coordinates from header for clean #134

Closed vsoch closed 4 years ago

vsoch commented 4 years ago

Signed-off-by: vsoch vsochat@stanford.edu

Description

This is start of work to be able to say "if this field is present, extract coordinates from here, and use for the cleaning" The rendered documentation for an example / how this works is shown in this screenshot here:

image

Open questions

I need to bring in the original requestor for the addition because it's not clear that the bounding box that is represented in the data is being used as expected. The referenced code creates a mask like this:

    #create bounding box. black out pixels outside of box
    mask[:Y0, :] = 0
    mask[:, :X0] = 0
    mask[Y1:, :] = 0
    mask[:, X1:] = 0

which is to say that, for example, all pixels from 0 UNTIL the Y min and all X are set to 0, and this is different from specifying a box like [xmin, ymin, xmax, ymax]. I'm not sure this approach is using the field as expected, but I need to check with the creators of it.

vsoch commented 4 years ago

Also to note - even if we can't get the coordinates spot on, this approach would still work for detection, to be able to say "if this header is present, flag the image."

vsoch commented 4 years ago

@nquach you method on paper makes sense, but I'm having trouble with the implementation. I've put both attempts at a gist here https://gist.github.com/vsoch/0f065335f8a936ac0a03348385b9889d. Specifically for the first you can see I am trying to emulate what you did here and for the second, a simpler approach to just index the bounding boxes directly for all layers here. Both produce a malformed dicom image,

image

so I'm hoping you might be able to take a look and see something that looks fishy. Also please feel free to pull this branch and PR to it, if you didn't I would want to add you as a contributor anyway. Thanks for your help!

nquach commented 4 years ago

Which version of pydicom are you using @vsoch? I've found that if a DICOM is processed and saved using one version of pydicom and opened using another version, it can sometimes produce the weird looking frames that you are seeing. I was using pydicom 1.3.0. See if changing the version works?

vsoch commented 4 years ago

@nquach looks like I'm using pydicom==1.4.2. I'll try uninstalling and using 1.3.0 instead.

Oh gosh, this is really concerning - with 1.3.0 it works perfectly. It must be that there is some change in how the save is done, or at least I can't think of a reason beyond that. @darcymason I think I need to pull you into the discussion here - if you look at the attached clean.py script, the way it works is to create a client:

from deid.dicom import DicomCleaner                                                                                                         
client = DicomCleaner()

and then detect looks for pixels to be blacked out based on headers:

client.detect('echo1.dcm')                                                                                                                  
{'flagged': True,
 'results': [{'reason': ' SequenceOfUltrasoundRegions present ',
   'group': 'graylist',
   'coordinates': ['231,70,784,657']},
  {'reason': 'and ImageType contains RECONSTRUCTION|SECONDARY|DERIVED',
   'group': 'graylist',
   'coordinates': []},
  {'reason': 'and ImageType contains DERIVED|SECONDARY|SCREEN SAVE|VOLREN|VXTL STATE',
   'group': 'graylist',
   'coordinates': []},
  {'reason': 'and ImageType contains DERIVED|SECONDARY|SCREEN|SAVE',
   'group': 'blacklist',
   'coordinates': []}]}

And then we run clean to black out those pixels:

client.clean()                                                                                                                              
Scrubbing echo1.dcm.

and save

client.save_dicom(os.getcwd())    

The issue we are hitting is that with pydicom 1.4.2 (what I previously had) the image saved is hugely malformed (see screenshot above). However when I downgrade to 1.3.0 as @nquach suggested, it looks okay:

image

Note that I'm using the Papaya viewer for quick viewing http://ric.uthscsa.edu/mango/papaya/index.html.

Is there a consistent way to save the data that would work between both versions? Right now my only choice appears to be to pin the pydicom version to exactly 1.3.0, which I suspect many users will eventually not be happy about. Right now the requirement is just greater than or equal to 1.2.1 (or something similar to that).

darcymason commented 4 years ago

This problem sounded familiar, but I haven't been able to find a corresponding issue in the issue list.

A couple of thoughts:

If I had to guess, I'm thinking v.1.4 tried to be helpful in modifying some of the data elements related to image settings, but didn't work correctly in this case.

darcymason commented 4 years ago
  • it would be very helpful for me if you could get it down to a minimal example

Actually, it occurred to me that even just getting the same file as output in v1.4 vs 1.3 could be helpful. And rather than sharing actual files, pydicom also has a diff utility - well, had, it has now moved to an example file but is very short so could be easy to copy and use, and post the diff.

vsoch commented 4 years ago

@darcymason thank you for the speedy reply! I just tested 2.0 and the bug seems to be there, so what I'll do (possibly not today but if not, definitely this weekend) is to reproduce a very basic dummy example and share with you.

darcymason commented 4 years ago

Okay, I've played with the examples @vsoch sent me. It seems to come from https://github.com/pydicom/pydicom/pull/935. The file has Photometric Interpretation of "YBR_FULL_422", which seems to be incorrect, as pointed to by the new (since pydicom 1.4) code warning that the pixel data length is not correct.

When I added dicom.PhotometricInterpretation = "RGB" prior to first getting the pixel_array, I get binary identical files for pydicom 2.0 and 1.3. You may also have to set it back to YBR_FULL_422 before saving for the viewer program, I'm not sure.

Perhaps pydicom should revisit this situation, maybe it could be handled better, as evidenced by viewers that seem to ignore the incorrect PI, if indeed it is incorrect.

vsoch commented 4 years ago

@darcymason sorry for the delay - I meant to test this and respond this morning but I got distracted! But there's still time! So first, I can reproduce what you did - when we first load the dicom (before we grab pixel data) it starts as YBR_FULL_422

dicom.PhotometricInterpretation                                                                                                             
'YBR_FULL_422'

Then I can change the value to RGB

dicom.PhotometricInterpretation = "RGB"

And go through the whole clean() and then save_dicom(), and the issue is fixed!

image

So @darcymason my next question is about when this should happen - should we only change to RGB given a certain modality? Here is what I did (and you can look at the files here too):

            # If we have ultrasound, must be RGB to obtain pixel data
            if dicom.Modality == "US" and dicom.PhotometricInterpretation != "RGB":
                photometric_original = dicom.PhotometricInterpretation
                dicom.PhotometricInterpretation = "RGB"
                self.original = dicom.pixel_array
                dicom.PhotometricInterpretation = photometric_original
            else:
                # We will set original image to image, cleaned to clean
                self.original = dicom.pixel_array

and does a dicom file always have this attribute (if not I'll need to use a get instead).

darcymason commented 4 years ago

should we only change to RGB given a certain modality?

I think it is related to any YBR_FULL_422 where the length doesn't add up correctly -- perhaps we can ping @scaramallion, who is much more likely to know...

does a dicom file always have this attribute

Yes, it is a type 1 required element for anything with pixel data, AFAIK.

scaramallion commented 4 years ago

It difficult to say without the example dataset as there have been a couple of changes to how YBR data has been dealt with.

vsoch commented 4 years ago

@scaramallion I can forward you the instructions and dataset to reproduce. Do you have an email you can share?

scaramallion commented 4 years ago

OK, example dataset:

(0002,0010) UI =LittleEndianExplicit                    #  20, 1 TransferSyntaxUID
(0028,0002) US 3                                        #   2, 1 SamplesPerPixel
(0028,0004) CS [YBR_FULL_422]                           #  12, 1 PhotometricInterpretation
(0028,0006) US 0                                        #   2, 1 PlanarConfiguration
(0028,0008) IS [145]                                    #   4, 1 NumberOfFrames
(0028,0009) AT (0018,1063)                              #   4, 1 FrameIncrementPointer
(0028,0010) US 708                                      #   2, 1 Rows
(0028,0011) US 1016                                     #   2, 1 Columns
(0028,0014) US 0                                        #   2, 1 UltrasoundColorDataPresent
(0028,0100) US 8                                        #   2, 1 BitsAllocated
(0028,0101) US 8                                        #   2, 1 BitsStored
(0028,0102) US 7                                        #   2, 1 HighBit
(0028,0103) US 0                                        #   2, 1 PixelRepresentation
(0028,2110) CS [01]                                     #   2, 1 LossyImageCompression
(0028,2112) DS [25.6]                                   #   4, 1 LossyImageCompressionRatio
(7fe0,0010) OB 00\00\00...                              # 312907680, 1 PixelData

Couple of comments:

In pydicom v1.4 we changed the way the expected pixel data length was calculated to take into account the downsampling for YBR_FULL_422. Before v1.4 no such correction was applied so the incorrect photometric intrepretation had no effect so the correct output was returned. The numpy pixel data handler was also modified to:

So there are two options I can see:

vsoch commented 4 years ago

@nquach are you aware of any modification to the original files that might have resulted in the wrong tag? Or did they come off the scanner like that? We probably want to address a fix as @scaramallion mentioned, but I'd be interested to know if there is any insight about it being a bit off in the first place.

scaramallion commented 4 years ago

The presence of the Lossy Image Compression elements makes me think it was originally compressed, then decompressed and not all relevant element values updated.

vsoch commented 4 years ago

@scaramallion if the user does dicom.decompress() are they required to update other attributes too? It looks like it might have been done in a script here.

nquach commented 4 years ago

@scaramallion you are correct, the original DICOMs that these examples were created from were originally compressed. The files needed to be decompressed before reading pixel data. My script did not intentionally change any of the tags, so I'm not sure if the Photometric Interpretation tag was YBR_FULL_422 in the original or if it was somehow added during decompression.

scaramallion commented 4 years ago

@scaramallion if the user does dicom.decompress() are they required to update other attributes too?

Yeah they are (see the note)

My script did not intentionally change any of the tags, so I'm not sure if the Photometric Interpretation tag was YBR_FULL_422 in the original or if it was somehow added during decompression.

Dataset.decompress() doesn't change the Photometric Interpretation

vsoch commented 4 years ago

@nquach I think you would have needed to set the Photometric Interpretation then, which I definitely think is advanced usage! For your current images, if you wanted to fix them up, I think you could write a script to iterate through them and do that. @scaramallion do we know why (or if) this check could have handled the fix?

@darcymason what do you think would be the best approach moving forward?

vsoch commented 4 years ago

Should this line be returned or do we always return False for JPEG?

scaramallion commented 4 years ago

@scaramallion do we know why (or if) this check could have handled the fix?

You could have overridden it to return True, but it's really just simpler to set ds.PhotometricInterpreation = 'RGB' after doing ds.decompress(), provided it makes sense to do so.

Some of the JPEG pixel data handlers will return YBR data instead of RGB so its not always the case that it should be done. And some JPEG data won't be in the colour space given by Photometric Interpretation. There's not really a general solution that will work with everything, unfortunately.

darcymason commented 4 years ago

There's not really a general solution that will work with everything, unfortunately.

  • in pydicom, change the warning to an exception

I'm thinking a case where pixel data doesn't match the expected length should probably raise an error. And also the error message could be enhanced by noting that it appears to have been decompressed but PI no longer correct, when one can see the lossy-compression-related data elements like you mentioned in this case.

But we should probably now move this over to a pydicom issue. I'm adding one now...

vsoch commented 4 years ago

okay, for the meantime here I updated the client to do a check, issue a warning, and then try to do a fix (and provide the user with how to set an argument to skip this step).

from deid.dicom import DicomCleaner                                                                          

> client = DicomCleaner()                                                                                      
WARNING No specification, loading default base deid.dicom

> client.detect('')                                                                                            

> client.detect('echo1.dcm')                                                                                   
{'flagged': True,
 'results': [{'reason': ' SequenceOfUltrasoundRegions present ',
   'group': 'graylist',
   'coordinates': ['231,70,784,657']},
  {'reason': 'and ImageType contains RECONSTRUCTION|SECONDARY|DERIVED',
   'group': 'graylist',
   'coordinates': []},
  {'reason': 'and ImageType contains DERIVED|SECONDARY|SCREEN SAVE|VOLREN|VXTL STATE',
   'group': 'graylist',
   'coordinates': []},
  {'reason': 'and ImageType contains DERIVED|SECONDARY|SCREEN|SAVE',
   'group': 'blacklist',
   'coordinates': []}]}

> client.clean()                                                                                               
Scrubbing echo1.dcm.
WARNING Updating dicom.PhotometricInterpretation to RGB, set fix_interpretation to False to skip.

client.save_dicom()                                                                                          
# '/tmp/deid-clean-n03k9uu1/cleaned-echo1.dcm'

In the future when pydicom can raise a specific error, it will be easier to catch this case (and I'll follow the issue). I also added a section to the cleaner documentation so a user is slightly less likely to decompress without changing the PixelInterpretation.

image

vsoch commented 4 years ago

@darcymason @scaramallion does the approach seem ok?

vsoch commented 4 years ago

whoops, added a format string when I shouldn't have :)

nquach commented 4 years ago

@vsoch Tested out DicomCleaner on other ultrasound DICOM files and it seems to work fine. It looks like the function save_png and get_figure have yet to be able to handle video file data, which would be awesome to have to ensure that masking occurred correctly and all PHI is hidden. Otherwise it looks good!

vsoch commented 4 years ago

Great! Could you please open an issue that shows me what you are looking for with respect to save, ideas that you have, and we can address with another PR / feature request?

vsoch commented 4 years ago

Thank you everyone! :100: