microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.89k stars 579 forks source link

word boundaries #254

Closed Cherchercher closed 4 years ago

Cherchercher commented 4 years ago

I have a lot of customed PII that I want to add. I am running into a lot of issue where numbers result in overlapping classification.

For example, 1234567890 9844412312312323 at https://presidio-demo.azurewebsites.net results in UMBER>.

I am very curious in learning about how this scenario occurs to best resolve the issue.

omri374 commented 4 years ago

Hi @Cherchercher. Overlaps make sense as we are running multiple recognizers independently.

Some things to consider in this case:

  1. Use a threshold to ignore identified PII with low score
  2. Come up with a logic to handle these overlaps. For example, this code (on our research repo) handles overlaps in a naive way: https://github.com/microsoft/presidio-research/blob/65f4239cd41360b362252fbd0557231a23e52fc3/presidio_evaluator/span_to_tag.py#L60 It looks at each token's prediction score to come up with a non-overlapping representation. Perhaps the unit tests would give a better hunch on how this works : https://github.com/microsoft/presidio-research/blob/65f4239cd41360b362252fbd0557231a23e52fc3/tests/test_span_to_tag.py#L160

The demo (and Presidio itself) do not handle overlaps as such logic should be owned by the application and not by the engine, to keep it as generic as possible.