microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.61k stars 553 forks source link

presidio-structured misidentifies email as URL #1316

Open ardband opened 6 months ago

ardband commented 6 months ago

Presidio-structured incorrectly identifies an email address as a URL within the extracted entities. This can be observed in the following example output:

StructuredAnalysis(entity_mapping={'name': 'PERSON', 'email': 'URL', 'city': 'LOCATION', 'state': 'LOCATION'})

the value in the "email" column ("john.doe@example.com") is mistakenly identified as a URL ("URL") instead of an email address ("EMAIL") during entity extraction.

miltonsim commented 6 months ago

I've also encountered this issue.

The issue mainly stems from the _find_most_common_entity() method where email addresses in test_structured.csv are being incorrectly identified as URLs, albeit with low confidence. It prioritises the entity with the highest count.

Observed behavior:

The emails are accurately recognised but are outnumbered by the URL identifications due to their higher frequency, despite lower confidence levels.

I would like to suggest two potential improvements:

  1. Adapting _find_most_common_entity() to Consider Confidence Scores: It might be beneficial to adjust the method to account for the actual confidence scores provided by the recognizer results.
  2. Enhancing the URL Recognizer: Improving the recognizer's ability to differentiate between URLs and email addresses could help reduce this type of misidentification

I'm keen to contribute to making these improvements and would love to work on refining the logic. Any thoughts or feedback on these suggestions would be greatly appreciated!

omri374 commented 6 months ago

Thanks for the feedback! the URL recognizer detects parts of emails as well (e.g. microsoft.com is a url inside john.doe@microsoft.com), which makes it detect more URLs than emails.

I think that a good way forward here would be to allow the user to decide on a strategy for the entity selected. In some cases, we would want the entity with the majority of cases, in others we'd like the one that has the highest confidence, and in others we might want a mix of the two (e.g. most common entity, if confidence > 0.5)

A quick fix could be to update the structured analysis once finalized, in case the column's name is "email" but the detection is actually "URL".

If you're interested in creating a PR, I'd be happy to review it and discuss.