Closed SimonRbk95 closed 6 months ago
Hi, Presidio is structured in a way that each recognizer supports only one language. The main reason is that there are language-specific attributes (like context words). In order to have a recognizer supporting multiple languages, it needs to be created multiple times.
But even if I use the above pipeline only for one language, where I do not supply multiple language arguments and trim down the config to one language, it works for language="en"
, but not "de"
. In fact, "de"
also does not work without loading predefined recognizers. The following is one version that works for "en"
but will yield a ValueError: No matching recognizers were found to serve the request
for "de"
. Maybe I misunderstood you, but I'd appreciate any advice.
text = """"Max's Konto: 012 3456789123"""
registry = RecognizerRegistry()
#registry.load_predefined_recognizers()
registry.add_recognizers_from_yaml(yaml_file)
provider = NlpEngineProvider(conf_file=conf_file)
nlp_engine = provider.create_engine()
# Change the default context similarity factor
context_aware_enhancer = LemmaContextAwareEnhancer(
context_similarity_factor=0.45, min_score_with_context_similarity=0.4
)
analyzer = AnalyzerEngine(
registry=registry,
supported_languages=["en"],
nlp_engine=nlp_engine,
context_aware_enhancer=context_aware_enhancer
)
supported_entities = ['BANK_ACCOUNT']
results = analyzer.analyze(text=text, language="en", entities=supported_entities)
recognizers:
-
name: "Bank Account Recognizer"
patterns:
-
name: "bank account (weak)"
regex: (?<!\d)(?:\d(?:[\\ -]{0,1}\d){8,12})(?!\d)
score: 0.01
context:
- kto
- Konto
supported_entity: "BANK_ACCOUNT"
Apologies for the delayed response. The default language for each recognizer is English. Could you please try to add de
as the supported_language
of the recognizer in the yaml and try again? See example here:
No worries. I had already adjusted that in the meantime. But I only came back to this issue myself a few minutes ago and figured out that I also need to import the other predefined recognizers seperately to adjust their language, including the SpacyRecognizer
for the PERSON
entity. The only thing that I am still trying to figure out is why the context enhancement does not increase the confidence score.
from presidio_analyzer.predefined_recognizers import SpacyRecognizer
registry = RecognizerRegistry()
registry.add_recognizers_from_yaml(yaml_file)
registry.add_recognizer(SpacyRecognizer(supported_language="de"))
registry.add_recognizer(SpacyRecognizer(supported_language="en"))
provider = NlpEngineProvider(conf_file=conf_file)
nlp_engine = provider.create_engine()
context_aware_enhancer = LemmaContextAwareEnhancer(
context_similarity_factor=0.45, min_score_with_context_similarity=0.4
)
analyzer = AnalyzerEngine(
registry=registry,
supported_languages=["de", "en"],
nlp_engine=nlp_engine,
context_aware_enhancer=context_aware_enhancer
)
text = "Kto. Konto 012 3456789123. Max Mustermann."
results = analyzer.analyze(text=text, language="de", entities=["PERSON", "BANK_ACCOUNT"])
I haven't tested the LemmaContextAwareEnhancer
on German, but there might be a difference from English, for example on how spaCy creates lemmas. If lemmas are missing, this could result in the LemmaContextAwareEnhancer
to not work.
By the way, if you pass the nlp_engine
and languages
into the RecognizerRegistry.load_predefined_recognizers
, you would not have to pass the language to the SpacyRecognizer
as it would be instantiated with the language in the provided conf_file.
Thank you!
I am trying to add a simple pattern recognizer to my registry and use it alongside some of the built-in recognizers for
language="de"
. However, I am facing a few issues:analyzer.get_supported_entities(language="de")
returns'BANK_ACCOUNT'
only. Yet, forlanguage="en"
it returns the other globally supported entities, including the custom recognizer, as expected.en
works fine, if context words are in english.Any idea as to what I might be doing wrong? Thank you!
Code
Output:
Config
conf_file
points to the following yaml file. However, other implementations with simple spacy models, lg, md, and sm, show the same behavior.Spacy Versions