Open omri374 opened 2 months ago
This looks like a bug.
To reproduce:
res = analyzer.analyze(text = text, language='en', return_decision_process=True)
for ress in res:
print()
print(f"text: {text[ress.start:ress.end]},"
f"\nentity: {ress.entity_type}, "
f"\nscore before: {ress.analysis_explanation.original_score}"
f"\nscore context improvement: {ress.analysis_explanation.score_context_improvement}"
f"\nsupporting context word: {ress.analysis_explanation.supportive_context_word}")
text: Five,
entity: CURRENCY,
score before: 0.01
score context improvement: 0.99
supporting context word: dollars
text: Nine,
entity: CURRENCY,
score before: 0.01
score context improvement: 0.99
supporting context word: dollars
text: Six,
entity: CURRENCY,
score before: 0.01
score context improvement: 0.99
supporting context word: dollars
text: Seven,
entity: CURRENCY,
score before: 0.01
score context improvement: 0
supporting context word:
Looks like this might be due to the models Part-of-Speech tagging rather than a Presidio bug.
The above example uses the default NLP Spacy model en_core_web_lg
with the text Will you be paying the entire balance of Five Hundred Thirty-Nine Dollars and Sixty-Seven cents
.
Dollars
has a capital D, it is categorised as a Proper Noun and the lemma is "Dollars"dollars
has a lowercase D, it is categorised as a Noun and the lemma is "dollar"cents
position in the sentence it is always categorised as a Noun with a lemma of "cent", whether or not the C is capitalisedThis can be seen with the following code:
import spacy
nlp = spacy.load("en_core_web_lg")
texts = [
"Will you be paying the entire balance of Five Hundred Thirty-Nine Dollars and Sixty-Seven cents",
"Will you be paying the entire balance of Five Hundred Thirty-Nine Dollars and Sixty-Seven Cents",
"Will you be paying the entire balance of Five Hundred Thirty-Nine dollars and Sixty-Seven cents",
"Will you be paying the entire balance of Five Hundred Thirty-Nine dollars and Sixty-Seven Cents",
"Will you be paying the entire balance of Sixty-Seven Cents and Five Hundred Thirty-Nine dollars",
]
for text in texts:
print("\n", text)
print(
"Text:", # Text: The original word text.
"Lemma:", # Lemma: The base form of the word.
"POS:", # POS: The simple universal part-of-speech tag.
"Tag:", # Tag: The detailed part-of-speech tag.
"Alpha:", # Alpha: Is the token an alpha character?
"Stop:", # Stop: Is the token part of a stop list, i.e. the most common words of the language?
sep="\t"
)
for token in nlp(text):
print(token.text, token.lemma_, token.pos_, token.tag_, token.is_alpha, token.is_stop, sep="\t")
Interestingly if the en_core_web_sm
model is used, then dollars is always categorised as a noun. So @mmoody-vv you could look at using this model.
As the LemmaContextAwareEnhancer
compares the context words to lemmas rather than the actual words in the text, I think using the singular form of the words is best. I can't see this anywhere in the docs, happy to add it if this is correct @omri374? So in this case, using "dollar"
and cent
should give you the behavior you're expecting @mmoody-vv
@hhobson thanks for this analysis! I found it surprising that Dollars
's lemma is Dollars
. This could be causing this issue. According to your analysis, it seems that a fix would be to lowercase the token prior to lemmatizing it, but that's not that straightforward as spaCy runs lemmatization and NER together, and we wouldn't want to pass a lowercase sentence as it would affect NER.
I agree, lowercasing the text doesn't feel the right thing to do. Especially as in this case the different sized spaCy models behaved differently, so things might change in future versions.
I think the best approach is to recommend using singular form context words, like dollar
rather than dollars
. When I tested this it produced the expected behavior of boosting the score.
Would that solve the problem if the sentence has upper case plurals to begin with? We would end up comparing dollar
with Dollars
I'm new to Presidio (started working with the code yesterday), but I can't figure out why I'm getting the results I am. Code is below. It doesn't seem to be recognizing "cents" in the context. However, if I turn it to 'cent' everything works fine. But that brings up another question, if it's basing the suffix count on "dollars", why is 'Six' (in Sixty) tagged? I assume I'm misunderstanding something. Any help would be appreciated.
Output:
[type: CURRENCY, start: 41, end: 45, score: 1, type: CURRENCY, start: 61, end: 65, score: 1, type: CURRENCY, start: 78, end: 81, score: 1, type: CURRENCY, start: 84, end: 89, score: 0.01]
Originally posted by @mmoody-vv in https://github.com/microsoft/presidio/discussions/1443