Open howff opened 1 year ago
That sounds great! Let me know what I can do to support you for that.
We (working on Microsoft Presidio) currently have this capability in beta: https://microsoft.github.io/presidio/image-redactor/ We'd be happy to collaborate on this.
Thanks very much for your contribution @omri374 ! I see that it's using Tesseract for OCR and SpaCy for NER/PII. In my experience Tesseract is dreadful at OCR in the real world (I'm testing on all radiology images for a whole country), needing too much pre-processing and then giving a very poor result. And in my experience SpaCy is very unreliable at NER in this context (it's ok for sentences, sometimes, but useless for text fragments found by OCR in radiology images). I'm happy to hear that you seem to have had better success though.
Hi @howff, Presidio is very customizable, and allows you to plug in multiple tools. Currently, we are using Tesseract, but we are working on a next version which would allow you to plug any OCR easily: https://github.com/microsoft/presidio/discussions/1049
As this is still in design, we'd be very happy to get your feedback on this based on your experience with DICOM de-identification and are open to contributions of all sorts.
For NER, we support multiple NLP tools like Huggingface and Flair as well. In our demo, you can experiment with two BERT based approaches, and a flair approach: https://huggingface.co/spaces/presidio/presidio_demo
I agree that any NER wouldn't necessarily be accurate for OCR, so we use hints from the DICOM metadata, and can customize the detection of PHI using other approaches such as rule based patterns and deny-lists.
cc @niwilso
That's exactly the same approach I've taken here (see ocrengine.py and nerengine.py) https://github.com/SMI/dicompixelanon
@howff this looks great! @vsoch and @howff, if you'd like to collaborate on this, and see how we can integrate all of this into a Presidio+PyDICOM tool, we would be happy to work on this together.
Yeah! I’m happy to help however I can.
I notice you have a request for anonymising pixel data using OCR. I have been working on this, but in a separate code base, not as modifications to deid. It turns out that the hardest part is the evaluation, not the actual OCR. What I can report right now is that easyocr (python library) gives really excellent results. There's still a few things to watch out for, but it would be quite easy to integrate easyocr into deid I think.