microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.45k stars 534 forks source link

Analyzer identifies Portuguese phone number as US bank account #1341

Open Gasewtag opened 4 months ago

Gasewtag commented 4 months ago

Describe the bug Analyzer identifies Portuguese phone number as US bank account

To Reproduce Steps to reproduce the behavior:

  1. Execute analyzer with the following text: "my name is John Doe my phone number is +351000000000" (please replace zeros with random digits 0-9)

  2. Execute anonymizer and retrieve the following result:

text: my name is my phone number is items: [ {'start': 41, 'end': 57, 'entity_type': 'US_BANK_NUMBER', 'text': '', 'operator': 'replace'}, {'start': 11, 'end': 20, 'entity_type': 'PERSON', 'text': '', 'operator': 'replace'} ]

Expected behavior: my name is my phone number is

omri374 commented 4 months ago

The vanilla phone numbers recognizer supports a subset of the countries: https://github.com/microsoft/presidio/blob/db8ff8254123a113a0d511a484647734021de612/presidio-analyzer/presidio_analyzer/predefined_recognizers/phone_recognizer.py#L27

Could you please try to add Portugal (if I got the country code right) and check again? Example code:

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.predefined_recognizers import PhoneRecognizer

analyzer = AnalyzerEngine()

# Remove default phone recognizer
analyzer.registry.remove_recognizer("PhoneRecognizer")

# Add custom one (which supports numbers starting with +351)
pt_phone_recognizer = PhoneRecognizer(supported_regions=["PT"])
analyzer.registry.add_recognizer(pt_phone_recognizer)

analyzer.analyze("my name is John Doe my phone number is +351000000000", language="en")

# Note that this is still not detected as a phone number because the number is not a valid Portuguese phone number. If I try another phone number, it works:

analyzer.analyze(text="my name is John Doe my phone number is +351210493000", language="en", score_threshold=0.4)

Output:

[type: PERSON, start: 11, end: 19, score: 0.85,
 type: PHONE_NUMBER, start: 39, end: 52, score: 0.75]