paulrinckens / timexy

A spaCy custom component that extracts and normalizes temporal expressions
MIT License
52 stars 8 forks source link

Increase support for large numbers. #5

Closed edemattos closed 2 years ago

edemattos commented 2 years ago

Thanks for creating this @paulrinckens!

I am interested in extending this to Portuguese, but I faced a compatibility issue: the number 14 can be written as both "catorze" or "quatorze", however Timexy only allows for one orthography due to the way num_words is indexed.

I think this can be resolved while also extending Timexy's support for durations longer than 20: by leveraging the LIKE_NUM attribute in the Matcher class, we can let spaCy do the heavy lifting.

self.matcher.add(
    key,
    [
        [
            {"LIKE_NUM": True, "OP": "+"},
            {"LOWER": {"IN": self.timexy_lang.modifiers}, "OP": "*"},
            {"LIKE_NUM": True, "OP": "*"},
            {"LOWER": {"IN": self.timexy_lang.modifiers}, "OP": "*"},
            {"LIKE_NUM": True, "OP": "*"},
            {"LOWER": {"IN": self.timexy_lang.modifiers}, "OP": "*"},
            {"LIKE_NUM": True, "OP": "*"},
            {"LOWER": {"IN": self.timexy_lang.modifiers}, "OP": "*"},
            {"LIKE_NUM": True, "OP": "*"},
            {"LOWER": {"IN": self.timexy_lang.modifiers}, "OP": "*"},
            {"LOWER": val.lower()},
        ]
    ],
    greedy="LONGEST"
)

In English, "and" may or may not appear in number words, but LIKE_NUM does not seem to capture this. I've therefore added optional modifiers (and optional subsequent number words) to allow for English numbers into the trillions.

We can also leverage the numerizer spaCy extension for converting number words into digits. Unfortunately it is only compatible with English at the moment, but it's great at what it does. Another candidate is text2num, which already supports multiple languages and seems to be under active development. Otherwise, it should also be feasible to implement a local method when adding new languages in lieu of using an external library, at least for numbers into the thousands or maybe millions. In any case, the changes made in this PR enhance support for English but don't break compatibility for German or French.

edemattos commented 2 years ago

Actually, I've realized LIKE_NUM does allow for English numbers containing "and", even when NER is disabled:

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")

nums = [
    "seven hundred ninety five million three hundred sixty four thousand and five hundred and fifty six days",
    "eight trillion and two hundred million seventy six days",
]

matcher = Matcher(nlp.vocab)
pattern = [{'LIKE_NUM': True}, {'OP': "+"}, {'LOWER': 'days'}]
matcher.add('num', [pattern], greedy="LONGEST")

for doc in nlp.pipe(nums, disable=['ner']):
    print(doc)
    matches = matcher(doc)
    for m in matches:
            print(f"### match:\n{doc[m[1]:m[2]]}")
    print()

Output:

seven hundred ninety five million three hundred sixty four thousand and five hundred and fifty six days
### match:
seven hundred ninety five million three hundred sixty four thousand and five hundred and fifty six days

eight trillion and two hundred million seventy six days
### match:
eight trillion and two hundred million seventy six days

but I can't replicate this behavior in Timexy 🤔

self.matcher.add(
    key,
    [
        [
            {"LIKE_NUM": True, "OP": "+"}, {"LOWER": val.lower()},
        ]
    ],
    greedy="LONGEST",
)

Output:

### seven hundred ninety five million three hundred sixty four thousand and five hundred and fifty six days
seven hundred ninety five million   CARDINAL    
fifty six days  timexy  TIMEX3 type="DURATION" value="P56D"

### eight trillion and two hundred million seventy six weeks
eight trillion and  MONEY   
two hundred million seventy six weeks   timexy  TIMEX3 type="DURATION" value="P200000076W"
edemattos commented 2 years ago

Sorry for the confusion, it was just a syntax error. The * operator needed to be separate from the LIKE_NUM attribute.

That would simplify the rule, but introduces an error with overlapping spans.

### Today is Feb 1990, six years after Feb 1984.
Today   DATE    
1990, six years timexy  TIMEX3 type="DURATION" value="P1990, 6Y"
Feb 1984    timexy  TIMEX3 type="DATE" value="1984-02-01T00:00:00"

I think it can be resolved by adding a preference for date rules before duration matches. I can try to address that if you're happy with this approach and would like to move forward. :)

edemattos commented 2 years ago

Sorry again, I've now learned that this approach may not be viable. The expected behavior for LIKE_NUM should not return full number words like three hundred and sixty five, because and does not have the Token attribute like_num=True. What I've done above is a fluke because I didn't realize {"OP": "+"} is a wildcard.