microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.71k stars 565 forks source link

How to change the default 0.85 score for `SpacyRecognizer`? #1372

Closed lifepillar closed 4 months ago

lifepillar commented 5 months ago

I have tried this with Presidio 2.2.354:

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.predefined_recognizers import SpacyRecognizer

custom_recognizer = SpacyRecognizer(ner_strength=0.25)

analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(custom_recognizer)

results = analyzer.analyze(
    text="Alice and Bob", language="en", return_decision_process=True, score_threshold=0.1
)

print(results)
print("------")
print([res.__dict__ for res in results])

The assigned score is always 0.85. How can I change that?

My goal is to define multiple SpacyRecognizers and control which takes precedence over which. At the moment, if two entities overlap, the larger one wins, or ties are resolved arbitrarily if the spans are the same. Am I missing something?

omri374 commented 5 months ago

Hi, please see the following code snippet:

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import SpacyNlpEngine, NerModelConfiguration

# Define which model to use
model_config = [{"lang_code": "en", "model_name": "en_core_web_lg"}]

ner_model_configuration = NerModelConfiguration(default_score = 0.6)

# Create the NLP Engine based on this configuration
spacy_nlp_engine = SpacyNlpEngine(models= model_config, ner_model_configuration=ner_model_configuration)

analyzer = AnalyzerEngine(nlp_engine=spacy_nlp_engine)
analyzer.analyze(...)

Using the NerModelConfiguration class you can further configure which entities the model returns, how they map to Presidio's entities and more.

https://microsoft.github.io/presidio/analyzer/nlp_engines/spacy_stanza/#how-ner-results-flow-within-presidio