microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.83k stars 574 forks source link

Custom Pattern Recognizer Not Working Properly with German Language in Analyzer Engine #1343

Closed SimonRbk95 closed 6 months ago

SimonRbk95 commented 7 months ago

I am trying to add a simple pattern recognizer to my registry and use it alongside some of the built-in recognizers for language="de". However, I am facing a few issues:

  1. analyzer.get_supported_entities(language="de") returns 'BANK_ACCOUNT' only. Yet, for language="en" it returns the other globally supported entities, including the custom recognizer, as expected.
  2. the context words do not work with the German language setting. en works fine, if context words are in english.

Any idea as to what I might be doing wrong? Thank you!

Code


# Define the regex pattern
regex =  r"(?<!\d)(?:\d(?:[^\d]{0,3}\d){6,12})(?!\d)"  # weak regex pattern
ba_pattern = Pattern(name="bankaccount_regex", regex=regex, score=0.01)

# Change the default context similarity factor
context_aware_enhancer = LemmaContextAwareEnhancer(
    context_similarity_factor=0.45, min_score_with_context_similarity=0.4
)

# Define the recognizer with the defined pattern and context words
ba_recognizer = PatternRecognizer(
    supported_entity="BANK_ACCOUNT", 
    patterns=[ba_pattern], 
    context=["Kto", "Konto"],
    supported_language="de"
)

# Create NLP engine based on configuration
provider = NlpEngineProvider(conf_file=conf_file)
nlp_engine = provider.create_engine()

# Create registry
registry = RecognizerRegistry()

# Load built in recognizers
registry.load_predefined_recognizers()

# Add the custom recognizer
registry.add_recognizer(ba_recognizer)

analyzer = AnalyzerEngine(
    registry=registry, 
    supported_languages=["de", "en"], 
    nlp_engine=nlp_engine, 
    context_aware_enhancer=context_aware_enhancer
)

supported_entities = ['PERSON', 'IBAN_CODE', 'CREDIT_CARD', 'BANK_ACCOUNT']

text = """Das Konto von Max Müller ist 012/3456789."""

results = analyzer.analyze(text=text, language="de", entities=supported_entities)

print("Analyzer Entities: ", analyzer.get_supported_entities(language="en"))
print("Analyzer Entities: ", analyzer.get_supported_entities(language="de"))
print(f"Result: {results}")

Output:

Analyzer Entities 'en':  ['AU_ACN', 'DATE_TIME', 'IP_ADDRESS', 'LOCATION', 'AU_MEDICARE', 'US_SSN', 'US_DRIVER_LICENSE', 'US_ITIN', 'AU_TFN', 'EMAIL_ADDRESS', 'ORGANIZATION', 'IBAN_CODE', 'CRYPTO', 'US_BANK_NUMBER', 'US_PASSPORT', 'PERSON', 'NRP', 'MEDICAL_LICENSE', 'IN_PAN', 'AU_ABN', 'CREDIT_CARD', 'IN_AADHAAR', 'UK_NHS', 'PHONE_NUMBER', 'URL', 'SG_NRIC_FIN']
Analyzer Entities 'de':  ['BANK_ACCOUNT']
Result: [type: BANK_ACCOUNT, start: 29, end: 40, score: 0.01]

Config

conf_file points to the following yaml file. However, other implementations with simple spacy models, lg, md, and sm, show the same behavior.

---
nlp_engine_name: transformers
models:
  -
    lang_code: de
    model_name:
      spacy: de_core_news_sm
      transformers: /bert_base_multilingual_cased_ner_hrl
  -
    lang_code: en
    model_name:
      spacy: en_core_web_sm
      transformers: /bert_base_multilingual_cased_ner_hrl
ner_model_configuration:
  labels_to_ignore:
    - O
    - I-ORG
    - B-LOC
    - I-LOC
  aggregation_strategy: simple # "simple", "first", "average", "max"
  stride: 16
  alignment_mode: strict # "strict", "contract", "expand"
  low_confidence_score_multiplier: 0.4

Spacy Versions

de-core-news-sm==3.4.0
de-core-news-lg==3.5.0
en-core-web-lg==3.5.0
en-core-web-sm==3.5.0
omri374 commented 7 months ago

Hi, Presidio is structured in a way that each recognizer supports only one language. The main reason is that there are language-specific attributes (like context words). In order to have a recognizer supporting multiple languages, it needs to be created multiple times.

SimonRbk95 commented 7 months ago

But even if I use the above pipeline only for one language, where I do not supply multiple language arguments and trim down the config to one language, it works for language="en", but not "de". In fact, "de" also does not work without loading predefined recognizers. The following is one version that works for "en" but will yield a ValueError: No matching recognizers were found to serve the request for "de". Maybe I misunderstood you, but I'd appreciate any advice.

text = """"Max's Konto: 012 3456789123"""
registry = RecognizerRegistry()
#registry.load_predefined_recognizers()
registry.add_recognizers_from_yaml(yaml_file)
provider = NlpEngineProvider(conf_file=conf_file)
nlp_engine = provider.create_engine()
# Change the default context similarity factor
context_aware_enhancer = LemmaContextAwareEnhancer(
    context_similarity_factor=0.45, min_score_with_context_similarity=0.4
)
analyzer = AnalyzerEngine(
    registry=registry, 
    supported_languages=["en"], 
    nlp_engine=nlp_engine, 
    context_aware_enhancer=context_aware_enhancer
)
supported_entities = ['BANK_ACCOUNT']
results = analyzer.analyze(text=text, language="en", entities=supported_entities)
recognizers:
  -
    name: "Bank Account Recognizer"
    patterns:
      -
         name: "bank account (weak)"
         regex: (?<!\d)(?:\d(?:[\\ -]{0,1}\d){8,12})(?!\d)
         score: 0.01
    context:
     - kto
     - Konto
    supported_entity: "BANK_ACCOUNT"
omri374 commented 7 months ago

Apologies for the delayed response. The default language for each recognizer is English. Could you please try to add de as the supported_language of the recognizer in the yaml and try again? See example here:

https://github.com/microsoft/presidio/blob/c7fa82518d28532384560ad0270a52abcdb95ec1/presidio-analyzer/conf/example_recognizers.yaml#L16

SimonRbk95 commented 7 months ago

No worries. I had already adjusted that in the meantime. But I only came back to this issue myself a few minutes ago and figured out that I also need to import the other predefined recognizers seperately to adjust their language, including the SpacyRecognizer for the PERSON entity. The only thing that I am still trying to figure out is why the context enhancement does not increase the confidence score.

from presidio_analyzer.predefined_recognizers import SpacyRecognizer

registry = RecognizerRegistry()
registry.add_recognizers_from_yaml(yaml_file)
registry.add_recognizer(SpacyRecognizer(supported_language="de"))
registry.add_recognizer(SpacyRecognizer(supported_language="en"))

provider = NlpEngineProvider(conf_file=conf_file)
nlp_engine = provider.create_engine()

context_aware_enhancer = LemmaContextAwareEnhancer(
    context_similarity_factor=0.45, min_score_with_context_similarity=0.4
)

analyzer = AnalyzerEngine(
    registry=registry, 
    supported_languages=["de", "en"], 
    nlp_engine=nlp_engine, 
    context_aware_enhancer=context_aware_enhancer
)

text = "Kto. Konto 012 3456789123. Max Mustermann."
results = analyzer.analyze(text=text, language="de", entities=["PERSON", "BANK_ACCOUNT"])
omri374 commented 7 months ago

I haven't tested the LemmaContextAwareEnhancer on German, but there might be a difference from English, for example on how spaCy creates lemmas. If lemmas are missing, this could result in the LemmaContextAwareEnhancer to not work.

By the way, if you pass the nlp_engine and languages into the RecognizerRegistry.load_predefined_recognizers, you would not have to pass the language to the SpacyRecognizer as it would be instantiated with the language in the provided conf_file.

SimonRbk95 commented 6 months ago

Thank you!