microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.6k stars 551 forks source link

Not understanding why DICOM redaction does not detect Patient Name on example data #1309

Open parataaito opened 6 months ago

parataaito commented 6 months ago

Hello !

First, thanks for this tool, it looks very promising, so congrats on the idea!

I have a question though. I followed walkthrough from here: I used the "0_ORIGINAL.dcm" file from the test files.

Here is my code to show it seems identical to the tutorial:

import pydicom
from presidio_image_redactor import DicomImageRedactorEngine
import matplotlib.pyplot as plt

def compare_dicom_images(
    instance_original: pydicom.dataset.FileDataset,
    instance_redacted: pydicom.dataset.FileDataset,
    figsize: tuple = (11, 11)
) -> None:
    """Display the DICOM pixel arrays of both original and redacted as images.

    Args:
        instance_original (pydicom.dataset.FileDataset): A single DICOM instance (with text PHI).
        instance_redacted (pydicom.dataset.FileDataset): A single DICOM instance (redacted PHI).
        figsize (tuple): Figure size in inches (width, height).
    """
    _, ax = plt.subplots(1, 2, figsize=figsize)
    ax[0].imshow(instance_original.pixel_array, cmap="gray")
    ax[0].set_title('Original')
    ax[1].imshow(instance_redacted.pixel_array, cmap="gray")
    ax[1].set_title('Redacted')
    plt.show()

# Set input and output paths
input_path = "0_ORIGINAL.dcm"
output_dir = "./output"

# Initialize the engine
engine = DicomImageRedactorEngine()

# Option 1: Redact from a loaded DICOM image
dicom_image = pydicom.dcmread(input_path)
redacted_dicom_image = engine.redact(dicom_image, use_metadata=True, fill="contrast")

compare_dicom_images(dicom_image, redacted_dicom_image)

However, my output is this: image

I don't understand why the Patient Name is not redacted like it is on your example : image

For additional info, I am using Python 3.11.2 (but I tried with 3.9 too).

PS: I did not put it in bug since I am not 100% sure it is. It's probably on my side but I have no idea where it comes from...

Thanks in advance :)

parataaito commented 6 months ago

Just want to add that I also followed the example_dicom_image_redactor.ipynb Here are my results: image image image image

parataaito commented 5 months ago

Hello ! It's been a month now and no news :'( Anybody had the same problem and managed to solve it?

omri374 commented 5 months ago

Apologies for the delay. We will look into this soon and report back.

omri374 commented 5 months ago

@parataiito a hotfix was created a a new version released. Could you please check again? Apologies for the late resolution on this!

omri374 commented 5 months ago

Closing for now, please re-open if needed.

parataaito commented 5 months ago

Thanks for the (very) quick reply! Going to check right away!

parataaito commented 5 months ago

Works like a charm on all the demo files! So that's perfect!

I also tested them on random data I generated and I was wondering if you understand why it does not work specifically on this on : sample_data.zip

image

Is it due to the fact the data I burnt in the pixel array is not matched to any value in the DICOM tags?

omri374 commented 4 months ago

The DICOM redactor either takes values from the tags, or uses different text based approaches to identify entities such as names. In this case the default spaCy model used by Presidio does is not able to detect "ez OY" as a name, but a different model can. I would suggest experimenting with changing Presidio's configuration. For example:

import pydicom

from presidio_analyzer import AnalyzerEngine, RecognizerResult
from presidio_analyzer.nlp_engine import TransformersNlpEngine
from presidio_image_redactor import ImageAnalyzerEngine, DicomImagePiiVerifyEngine, DicomImageRedactorEngine
model_config = [
    {
        "lang_code": "en",
        "model_name": {
            "spacy": "en_core_web_sm",
            "transformers": "StanfordAIMI/stanford-deidentifier-base",
        },
    }
]

nlp_engine = TransformersNlpEngine(models=model_config)
text_analyzer_engine = AnalyzerEngine(nlp_engine=nlp_engine)
image_analyzer_engine = ImageAnalyzerEngine(analyzer)
dicom_engine = DicomImagePiiVerifyEngine(image_analyzer_engine=image_analyzer_engine)

instance = pydicom.dcmread(file_of_interest)
verify_image, ocr_results, analyzer_results = dicom_engine.verify_dicom_instance(instance, padding_width=25, show_text_annotation=True)

Running this version with the spaCy model does not identify the bounding box with a name as PII, whereas this transformers model (StanfordAIMI/stanford-deidentifier-base) does. I would suggest to further look into ways to improve and customize the PII detection flows with Presidio: https://microsoft.github.io/presidio/tutorial/

jhssilva commented 3 months ago

Hi @omri374 . I've the problem that the DICOM Redaction doesn't detect the text on the header. Please refer to the following image. (I'll redact the data from the patience and set as blur as this is an official image.)

2024-05-09_14-54-01

This is the code that I'm currently using:

input_path = "./test"
output_dir = "./output"

engine = DicomImageRedactorEngine()

pattern_all_text = Pattern(name="any_text", regex=r"(?s).*", score=0.5)
custom_recognizer = PatternRecognizer(
    supported_entity="TEXT",
    patterns=[pattern_all_text]
)

dicom_image = pydicom.dcmread(input_path)
redacted_dicom_image = engine.redact(dicom_image, fill="background", use_metadata=False , ad_hoc_recognizers = [custom_recognizer], allow_list=[])
redacted_dicom_image.save_as(f"{output_dir}/redacted_dicom.dcm")

redact_image = pydicom.dcmread(output_dir + "/redacted_dicom.dcm")
redact_image = redact_image.pixel_array
plt.imshow(redact_image, cmap='gray')
plt.show()

It redacts all the information less the header.

omri374 commented 3 months ago

It could be an OCR issue, where the OCR just can't detect the bounding box. Have you looked into the bounding boxes returned by the OCR?

omri374 commented 3 months ago

adding @niwilso and @ayabel in case they have any recommendations here as DICOM experts.

jhssilva commented 3 months ago

Thank you for the answer @omri374.
Should I look into something particular in the bboxes?

This is the output of the simple program.

2024-05-11_10-44-59 2024-05-11_11-25-34

I've followed the following documentation. The header doesn't seem to be detected by the bboxes.

Regarding the image this is an DICOM image ultrasound. Even if I save it as a normal image and then use presidio the issue persists.

ayabel commented 3 months ago

hi @jhssilva, it might be because the contrast between the text and the background is relatively low. In this case, you might want to consider preprocessing the image before feeding it to the redactor. Ideas for such preprocessing functions could be found here:

presidio-image-redactor/presidio_image_redactor/image_processing_engine.py Specifically, applying the cv2.adaptiveThreshold function could help increase the contrast

jhssilva commented 3 months ago

Hey @ayabel . Thank you for your input and guidance.

I've tested with the adaptiveThreshold as suggested. However in my case it creates a problem as I need the images to stay with the original contrast. (for now, possibly it will change in the future)

Being said that I've decided to take a different approach. Selecting the top part of the image redacting and then bundle the images together. This approach seems to work. Example,

pattern_all_text = Pattern(name="any_text", regex=r"(?s).*", score=0.5)
custom_recognizer = PatternRecognizer(
    supported_entity="TEXT",
    patterns=[pattern_all_text]
)
dicom_image = Image.open("new_image.png")

top_height = 60

# Convert the original image to a numpy array
image = np.array(dicom_image)

top_part = image[0:top_height, :]

rest_of_image = image[top_height:, :]

# Convert the top part of the image back to a PIL Image
top_part_image = Image.fromarray(top_part)

redacted_image = redactor_image.redact(top_part_image, fill="black", ad_hoc_recognizers=[custom_recognizer], allow_list=[])

final_image = np.concatenate((redacted_image, rest_of_image), axis=0)

plt.imshow(final_image)
plt.show()

Note: In this example I didn't redact the bottom part of the image.

Suggestion: Would be nice to have an example to such cases in the documentation as using the adaptive treshold or use the approach that I've suggested to specific cases.

Image Output

2024-05-15_22-50-48