johnperry / CTP

Clinical Trial Processor
http://mircwiki.rsna.org/index.php?title=CTP_Articles
65 stars 55 forks source link

Logic for Dicom Burned Pixels #16

Closed vsoch closed 7 years ago

vsoch commented 7 years ago

hey @johnperry ! I'm implementing a simple de-identification tool in Python, and was reading about CTP in this paper, along with the different criteria outlined. From what I can tell, most places don't seem to have a clear programmatic way to determine if there are Burned in Pixels - it's pretty vague that some images have the Burned In Annotation header and others do not, and even in the chapter it is pretty vague that CT images do not normally but ultrasound do. Anyway - a first good idea I thought would be to mimic what you are doing in CTP, and I wanted to check with you about what this translates to:

![0008,0008].contains("SAVE") * 
![0008,103e].contains("SAVE") * 
[0018,1012].equals("") *
[0018,1016].equals("") *
[0018,1018].equals("") *
[0018,1019].equals("") *
![0028,0301].contains("YES")

I translate this to mean:

has_annotation is False given:

the image was not saved with some other software

    # ![0008,0008].contains("SAVE") * 
    if "SAVE" in dicom.get('ImageType',[]):
        has_annotation=True

    # ![0008,103e].contains("SAVE") * 
    if "SAVE" in dicom.get('SeriesDescription',[]):
        has_annotation=True

There are no flags to indicate secondary capture

    # [0018,1012].equals("") *
    if dicom.get('DateOfSecondaryCapture') is not None:
        has_annotation=True

    # [0018,1016].equals("") *
    if dicom.get('SecondaryCaptureDeviceManufacturer') is not None:
        has_annotation = True

    # [0018,1018].equals("") *
    if dicom.get('SecondaryCaptureDeviceManufacturerModelName') is not None:
        has_annotation = True

    # [0018,1019].equals("") *
    if dicom.get('SecondaryCaptureDeviceSoftwareVersions') is not None:
        has_annotation = True

and the image is not flagged to have a Burned Annotation

    # ![0028,0301].contains("YES")
    if dicom.get('BurnedInAnnotation','no').upper() == "YES":
        has_annotation = True

is this correct? I was wondering how the CTP algorithm goes about it - does it use this filter first, and given passing, does it then run the image through DicomPixelAnonymizer.script to find the exact locations? With the above, on a small testing set it found images that were saved with software, but a few images that had a L/R direction and date were flagged as clean. My intuition is that there are other checks we should be doing with regard to Modality (or some other fields?) or that all images should be checked regardless. I'm working on a container that uses text detection that we could run over flagged images, but it would be better if we could deduce this entirely from the header data. Thanks in advance for your advice!

johnperry commented 7 years ago

The script language is described in:

 http://mircwiki.rsna.org/index.php?title=The_CTP_DICOM_Pixel_Anonymizer 

and

 http://mircwiki.rsna.org/index.php?title=The_CTP_DICOM_Filter

The former article explains signatures and regions.

The latter article explains the operators and methods available in defining signatures.

JP

From: Vanessa Sochat Sent: Friday, June 23, 2017 9:16 PM To: johnperry/CTP Cc: John Perry ; Mention Subject: [johnperry/CTP] Logic for Dicom Burned Pixels (#16)

hey @johnperry ! I'm implementing a simple de-identification tool in Python, and was reading about CTP in this paper, along with the different criteria outlined. From what I can tell, most places don't seem to have a clear programmatic way to determine if there are Burned in Pixels - it's pretty vague that some images have the Burned In Annotation header and others do not, and even in the chapter it is pretty vague that CT images do not normally but ultrasound do. Anyway - a first good idea I thought would be to mimic what you are doing in CTP, and I wanted to check with you about what this translates to:

![0008,0008].contains("SAVE") ![0008,103e].contains("SAVE") [0018,1012].equals("") [0018,1016].equals("") [0018,1018].equals("") [0018,1019].equals("") ![0028,0301].contains("YES") I translate this to mean:

has_annotation is False given:

the image was not saved with some other software

# ![0008,0008].contains("SAVE") * 
if "SAVE" in dicom.get('ImageType',[]):
    has_annotation=True

# ![0008,103e].contains("SAVE") * 
if "SAVE" in dicom.get('SeriesDescription',[]):
    has_annotation=True

There are no flags to indicate secondary capture

# [0018,1012].equals("") *
if dicom.get('DateOfSecondaryCapture') is not None:
    has_annotation=True

# [0018,1016].equals("") *
if dicom.get('SecondaryCaptureDeviceManufacturer') is not None:
    has_annotation = True

# [0018,1018].equals("") *
if dicom.get('SecondaryCaptureDeviceManufacturerModelName') is not None:
    has_annotation = True

# [0018,1019].equals("") *
if dicom.get('SecondaryCaptureDeviceSoftwareVersions') is not None:
    has_annotation = True

and the image is not flagged to have a Burned Annotation

# ![0028,0301].contains("YES")
if dicom.get('BurnedInAnnotation','no').upper() == "YES":
    has_annotation = True

is this correct? I was wondering how the CTP algorithm goes about it - does it use this filter first, and given passing, does it then run the image through DicomPixelAnonymizer.script to find the exact locations? With the above, on a small testing set it found images that were saved with software, but a few images that had a L/R direction and date were flagged as clean. My intuition is that there are other checks we should be doing with regard to Modality (or some other fields?) or that all images should be checked regardless. I'm working on a container that uses text detection that we could run over flagged images, but it would be better if we could deduce this entirely from the header data. Thanks in advance for your advice!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

vsoch commented 7 years ago

Thanks for the details! I implemented an extension to the pydicom Dataset to have similar functions, and also represented the different criteria (but in json) and the whole thing runs in this function that is called via the command line client. I'm new to DICOM so please excuse my ignorance - but is this typically a robust set of criteria for finding PHI? I just ran the functions across several datasets at Stanford, and they didn't do well to detect much of anything. Also, what is the reason that many of the numeric (eg, Rows) fields are implemented as containsIgnoreCase instead of equals? It may be pydicom that converts to numeric, but it would make sense to have rows/columns be a number, or at most a number in a list. Thanks for your wisdom!

johnperry commented 7 years ago

The DicomPixelAnonymizer doesn't find PHI; it just blanks regions of pixels that have been identified by users to contain PHI. The script file contains the signatures that identify specific image types and the regions of those image types to blank. Over the years, users have contributed updates, and I have included them in CTP releases. I haven't edited their contributions.

If the signatures don't catch Stanford images containing burned-in PHI, then I assume that nobody has ever contributed signatures and regions for those image types. The most prolific contributors have been people at the University of Michigan.

You are right to suggest that the containsIgnoreCase method doesn't make sense for numeric elements.

I went through the script, fixed a lot of things, and pushed an update to GitHub:

https://github.com/johnperry/CTP/blob/master/source/files/scripts/DicomServiceAnonymizer.script

A complete de-identification requires (in CTP terms) both the DicomAnonymizer and the DicomPixelAnonymizer.

JP

From: Vanessa Sochat Sent: Saturday, June 24, 2017 10:31 PM To: johnperry/CTP Cc: John Perry ; Mention Subject: Re: [johnperry/CTP] Logic for Dicom Burned Pixels (#16)

Thanks for the details! I implemented an extension to the pydicom Dataset to have similar functions, and also represented the different criteria (but in json) and the whole thing runs in this function that is called via the command line client. I'm new to DICOM so please excuse my ignorance - but is this typically a robust set of criteria for finding PHI? I just ran the functions across several datasets at Stanford, and they didn't do well to detect much of anything. Also, what is the reason that many of the numeric (eg, Rows) fields are implemented as containsIgnoreCase instead of equals? It may be pydicom that converts to numeric, but it would make sense to have rows/columns be a number, or at most a number in a list. Thanks for your wisdom!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

vsoch commented 7 years ago

Ah thank you @johnperry ! Your details are very helpful, and I definitely will be looking through the fields to mimic best practice (right now we blank most things). It does seem like finding the PHI in the pixels is a challenging thing - I started on an OCR Docker image to find regions with image processing / machine learning (and it needs a lot of work!) but hopefully we will have some combined method to filter the set using header fields, and then run character recognition and blacking out with the image. Closing issue - thanks again!