microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.88k stars 578 forks source link

DICOM verify engine: remove duplicates by score, all PHIs are PERSONs #1037

Open SharonHart opened 1 year ago

SharonHart commented 1 year ago

A few bugs in DICOM verify engine now causing test_dicom_image_pii_verify_engine_integration.py tests fail:

  1. When we remove duplicates - we take the first element regardless of the score - code pointer After fixing it to take the higher score it now took a PERSON entity with value '16' and score 1.0 over a real PERSON entity from spacy with score 0.85.
  2. How '16' was identifies as PERSON? another bug in which we treat the DICOM metadata as PHI and add each element to a deny list with PERSON as the entity.

But why it is failing now??? probably spacy in its latest version started finding more PERSON entities that are sometimes overridden and sometimes not when removing duplicates.

@omri374 @niwilso

Tests were skipped in https://github.com/microsoft/presidio/pull/1032

omri374 commented 1 year ago

@SharonHart can this be closed or not yet?

SharonHart commented 1 year ago

@SharonHart can this be closed or not yet?

We are still tagging DICOM metadata as PERSON.