pydicom / deid

best effort anonymization for medical images using python
https://pydicom.github.io/deid/
MIT License
138 stars 43 forks source link

Anonymise DICOM pixel data using OCR #252

Open howff opened 1 year ago

howff commented 1 year ago

I notice you have a request for anonymising pixel data using OCR. I have been working on this, but in a separate code base, not as modifications to deid. It turns out that the hardest part is the evaluation, not the actual OCR. What I can report right now is that easyocr (python library) gives really excellent results. There's still a few things to watch out for, but it would be quite easy to integrate easyocr into deid I think.

vsoch commented 1 year ago

That sounds great! Let me know what I can do to support you for that.

omri374 commented 1 year ago

We (working on Microsoft Presidio) currently have this capability in beta: https://microsoft.github.io/presidio/image-redactor/ We'd be happy to collaborate on this.

howff commented 1 year ago

Thanks very much for your contribution @omri374 ! I see that it's using Tesseract for OCR and SpaCy for NER/PII. In my experience Tesseract is dreadful at OCR in the real world (I'm testing on all radiology images for a whole country), needing too much pre-processing and then giving a very poor result. And in my experience SpaCy is very unreliable at NER in this context (it's ok for sentences, sometimes, but useless for text fragments found by OCR in radiology images). I'm happy to hear that you seem to have had better success though.

omri374 commented 1 year ago

Hi @howff, Presidio is very customizable, and allows you to plug in multiple tools. Currently, we are using Tesseract, but we are working on a next version which would allow you to plug any OCR easily: https://github.com/microsoft/presidio/discussions/1049

As this is still in design, we'd be very happy to get your feedback on this based on your experience with DICOM de-identification and are open to contributions of all sorts.

For NER, we support multiple NLP tools like Huggingface and Flair as well. In our demo, you can experiment with two BERT based approaches, and a flair approach: https://huggingface.co/spaces/presidio/presidio_demo

I agree that any NER wouldn't necessarily be accurate for OCR, so we use hints from the DICOM metadata, and can customize the detection of PHI using other approaches such as rule based patterns and deny-lists.

omri374 commented 1 year ago

cc @niwilso

howff commented 1 year ago

That's exactly the same approach I've taken here (see ocrengine.py and nerengine.py) https://github.com/SMI/dicompixelanon

omri374 commented 1 year ago

@howff this looks great! @vsoch and @howff, if you'd like to collaborate on this, and see how we can integrate all of this into a Presidio+PyDICOM tool, we would be happy to work on this together.

vsoch commented 1 year ago

Yeah! I’m happy to help however I can.