microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.78k stars 569 forks source link

Context words are used outside the suffix/prefix window #1444

Open omri374 opened 2 months ago

omri374 commented 2 months ago

I'm new to Presidio (started working with the code yesterday), but I can't figure out why I'm getting the results I am. Code is below. It doesn't seem to be recognizing "cents" in the context. However, if I turn it to 'cent' everything works fine. But that brings up another question, if it's basing the suffix count on "dollars", why is 'Six' (in Sixty) tagged? I assume I'm misunderstanding something. Any help would be appreciated.

from typing import List
import pprint

from presidio_analyzer import (
    AnalyzerEngine,
    PatternRecognizer,
    EntityRecognizer,
    Pattern,
    RecognizerResult,
)
from presidio_analyzer.recognizer_registry import RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngine, SpacyNlpEngine, NlpArtifacts
from presidio_analyzer.context_aware_enhancers import LemmaContextAwareEnhancer

text = "Will you be paying the entire balance of Five Hundred Thirty-Nine Dollars and Sixty-Seven cents?"

regex = r"(zero|one|two|three|four|five|six|seven|eight|nine)"
currency_pattern = Pattern(name="currency_pattern (strong)", regex=regex, score=.01)

currency_recognizer_with_context = PatternRecognizer(
    supported_entity='CURRENCY',
    patterns=[currency_pattern],
    context=[
        'dollars',
        'cents',
    ]
)

context_aware_enhancer = LemmaContextAwareEnhancer(
    context_similarity_factor=1, 
    min_score_with_context_similarity=1,
    context_prefix_count=0,
    context_suffix_count=6,
)

registry = RecognizerRegistry()
registry.add_recognizer(currency_recognizer_with_context)
analyzer = AnalyzerEngine(registry=registry, context_aware_enhancer=context_aware_enhancer)

res = analyzer.analyze(text = text, language='en')

Output: [type: CURRENCY, start: 41, end: 45, score: 1, type: CURRENCY, start: 61, end: 65, score: 1, type: CURRENCY, start: 78, end: 81, score: 1, type: CURRENCY, start: 84, end: 89, score: 0.01]

Originally posted by @mmoody-vv in https://github.com/microsoft/presidio/discussions/1443

omri374 commented 2 months ago

This looks like a bug.

To reproduce:

res = analyzer.analyze(text = text, language='en', return_decision_process=True)

for ress in res:
    print()
    print(f"text: {text[ress.start:ress.end]}," 
    f"\nentity: {ress.entity_type}, "
    f"\nscore before: {ress.analysis_explanation.original_score}"
    f"\nscore context improvement: {ress.analysis_explanation.score_context_improvement}"
    f"\nsupporting context word: {ress.analysis_explanation.supportive_context_word}")
text: Five,
entity: CURRENCY, 
score before: 0.01
score context improvement: 0.99
supporting context word: dollars

text: Nine,
entity: CURRENCY, 
score before: 0.01
score context improvement: 0.99
supporting context word: dollars

text: Six,
entity: CURRENCY, 
score before: 0.01
score context improvement: 0.99
supporting context word: dollars

text: Seven,
entity: CURRENCY, 
score before: 0.01
score context improvement: 0
supporting context word: 
hhobson commented 2 months ago

Looks like this might be due to the models Part-of-Speech tagging rather than a Presidio bug.

The above example uses the default NLP Spacy model en_core_web_lg with the text Will you be paying the entire balance of Five Hundred Thirty-Nine Dollars and Sixty-Seven cents.

This can be seen with the following code:

import spacy

nlp = spacy.load("en_core_web_lg")

texts = [
"Will you be paying the entire balance of Five Hundred Thirty-Nine Dollars and Sixty-Seven cents",
"Will you be paying the entire balance of Five Hundred Thirty-Nine Dollars and Sixty-Seven Cents",
"Will you be paying the entire balance of Five Hundred Thirty-Nine dollars and Sixty-Seven cents",
"Will you be paying the entire balance of Five Hundred Thirty-Nine dollars and Sixty-Seven Cents",
"Will you be paying the entire balance of Sixty-Seven Cents and Five Hundred Thirty-Nine dollars",
]

for text in texts:
    print("\n", text)

    print(
        "Text:",  # Text: The original word text.
        "Lemma:",  # Lemma: The base form of the word.
        "POS:",  # POS: The simple universal part-of-speech tag.
        "Tag:",  # Tag: The detailed part-of-speech tag.
        "Alpha:",  # Alpha: Is the token an alpha character?
        "Stop:",  # Stop: Is the token part of a stop list, i.e. the most common words of the language?
        sep="\t"
    )
    for token in nlp(text):
        print(token.text, token.lemma_, token.pos_, token.tag_, token.is_alpha, token.is_stop, sep="\t")

Interestingly if the en_core_web_sm model is used, then dollars is always categorised as a noun. So @mmoody-vv you could look at using this model.

As the LemmaContextAwareEnhancer compares the context words to lemmas rather than the actual words in the text, I think using the singular form of the words is best. I can't see this anywhere in the docs, happy to add it if this is correct @omri374? So in this case, using "dollar" and cent should give you the behavior you're expecting @mmoody-vv

omri374 commented 2 months ago

@hhobson thanks for this analysis! I found it surprising that Dollars's lemma is Dollars. This could be causing this issue. According to your analysis, it seems that a fix would be to lowercase the token prior to lemmatizing it, but that's not that straightforward as spaCy runs lemmatization and NER together, and we wouldn't want to pass a lowercase sentence as it would affect NER.

hhobson commented 2 months ago

I agree, lowercasing the text doesn't feel the right thing to do. Especially as in this case the different sized spaCy models behaved differently, so things might change in future versions.

I think the best approach is to recommend using singular form context words, like dollar rather than dollars. When I tested this it produced the expected behavior of boosting the score.

omri374 commented 2 months ago

Would that solve the problem if the sentence has upper case plurals to begin with? We would end up comparing dollar with Dollars