microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.61k stars 553 forks source link

Anonymizing two intersecting entities where the one with the larger span has the lower score. #1156

Open OnsElleuch opened 1 year ago

OnsElleuch commented 1 year ago

Is your feature request related to a problem? Please describe. I'm always frustrated when there two entities that are intersecting, they start at the same position and one of them ends after the other, and the entity that has the bigger span has the lowest score, anonymization results in the replacement of the bigger entity.

from presidio_analyzer import RecognizerResult
text = 'Name: Word1 Word2 Word3'
engine = AnonymizerEngine()
analyzer_results = [
    RecognizerResult("ENTITY1", start=6, end = 17, score=0.1),
    RecognizerResult("ENTITY2", start=6, end = 11, score=1),
]
result = engine.anonymize(
    text=text, analyzer_results=analyzer_results
)

print("De-identified text")
print(result.text)

the result is Name: <Entity1> Word3

Describe the solution you'd like I expected Name: <Entity2> <Entity1> Word3

Describe alternatives you've considered I can also not use the anonymizer and implement the solution to replace everything on my own. But it makes more sense to have this done in the library.

omri374 commented 1 year ago

Hi @OnsElleuch, the logic we have for conflict resolution definitely has some assumptions which might not hold for everyone. Because the analyzer and anonymizer are two distinct modules, have you considered doing some conflict resolution in between? You could come up with a different approach, and provide non-overlapping entities to the anonymizer. Would that be a viable option?

OnsElleuch commented 1 year ago

@omri374 I can of course try to resolve the conflicts happening on my own. What's confusing me is that this is a partial intersection but is not considered a partial one. or is this another case?