microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.82k stars 574 forks source link

Anonymizer does not work #1281

Closed Siddhartha90 closed 9 months ago

Siddhartha90 commented 9 months ago

Describe the bug

test data 630-596-1111

redacts ok to give back test data

but test data630-596-1111

does not redact the phone number.

To Reproduce


try:
    nlp = spacy.load("en_core_web_lg")
except OSError:
    # Model not found, download it
    spacy.cli.download("en_core_web_lg")
    nlp = spacy.load("en_core_web_lg")

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

analyzer_results = analyzer.analyze(
    text=text_to_anonymize,
    entities=["PHONE_NUMBER", "EMAIL_ADDRESS", "URL", "LOCATION"],
    language="en",
)
scrubbed_text = anonymizer.anonymize(
    text=text_to_anonymize,
    analyzer_results=analyzer_results,
    operators={"DEFAULT": OperatorConfig("redact", {})},
).text

print(scrubbed_text)

Expected behavior Phone number gets scrubbed.

I'm on

presidio-analyzer==2.2.351 presidio-anonymizer==2.2.351

VMD7 commented 9 months ago

Hi @Siddhartha90 I have verified your code with mentioned versions, its working as expected. It successfully redacting the phone number. Please check your spacy model, does it modified or trained by something. You can also verify the same on demo site as well, here its link - https://huggingface.co/spaces/presidio/presidio_demo

omri374 commented 9 months ago

@Siddhartha90 can you please provide a reproducible example? Both samples seem identical.

omri374 commented 9 months ago

Closing for now, please re-open if needed.

Siddhartha90 commented 9 months ago

@VMD7 @omri374 Apologies, i updated the description with the correct example, I was unable to reopen the issue, so i created a new one - https://github.com/microsoft/presidio/issues/1301

VMD7 commented 8 months ago

Hi @Siddhartha90 Thanks for your response. Actually the problem is not with the anonymizer or analyzer engine. The main problem lies in the model we are using to predict the entity. You can try with little larger models such as "stanford-deidentifier-base" or "deid_roberta_i2b2". You can experiment with different models and use according to your need. If you wanna use spacy model only then you can train or finetune the spacy model with these kind of examples and use the same, you will get results as expected.