Closed edemattos closed 2 years ago
Actually, I've realized LIKE_NUM
does allow for English numbers containing "and", even when NER is disabled:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
nums = [
"seven hundred ninety five million three hundred sixty four thousand and five hundred and fifty six days",
"eight trillion and two hundred million seventy six days",
]
matcher = Matcher(nlp.vocab)
pattern = [{'LIKE_NUM': True}, {'OP': "+"}, {'LOWER': 'days'}]
matcher.add('num', [pattern], greedy="LONGEST")
for doc in nlp.pipe(nums, disable=['ner']):
print(doc)
matches = matcher(doc)
for m in matches:
print(f"### match:\n{doc[m[1]:m[2]]}")
print()
Output:
seven hundred ninety five million three hundred sixty four thousand and five hundred and fifty six days
### match:
seven hundred ninety five million three hundred sixty four thousand and five hundred and fifty six days
eight trillion and two hundred million seventy six days
### match:
eight trillion and two hundred million seventy six days
but I can't replicate this behavior in Timexy 🤔
self.matcher.add(
key,
[
[
{"LIKE_NUM": True, "OP": "+"}, {"LOWER": val.lower()},
]
],
greedy="LONGEST",
)
Output:
### seven hundred ninety five million three hundred sixty four thousand and five hundred and fifty six days
seven hundred ninety five million CARDINAL
fifty six days timexy TIMEX3 type="DURATION" value="P56D"
### eight trillion and two hundred million seventy six weeks
eight trillion and MONEY
two hundred million seventy six weeks timexy TIMEX3 type="DURATION" value="P200000076W"
Sorry for the confusion, it was just a syntax error. The *
operator needed to be separate from the LIKE_NUM
attribute.
That would simplify the rule, but introduces an error with overlapping spans.
### Today is Feb 1990, six years after Feb 1984.
Today DATE
1990, six years timexy TIMEX3 type="DURATION" value="P1990, 6Y"
Feb 1984 timexy TIMEX3 type="DATE" value="1984-02-01T00:00:00"
I think it can be resolved by adding a preference for date rules before duration matches. I can try to address that if you're happy with this approach and would like to move forward. :)
Sorry again, I've now learned that this approach may not be viable. The expected behavior for LIKE_NUM
should not return full number words like three hundred and sixty five
, because and
does not have the Token
attribute like_num=True
. What I've done above is a fluke because I didn't realize {"OP": "+"}
is a wildcard.
Thanks for creating this @paulrinckens!
I am interested in extending this to Portuguese, but I faced a compatibility issue: the number 14 can be written as both "catorze" or "quatorze", however Timexy only allows for one orthography due to the way
num_words
is indexed.I think this can be resolved while also extending Timexy's support for durations longer than 20: by leveraging the
LIKE_NUM
attribute in the Matcher class, we can let spaCy do the heavy lifting.In English, "and" may or may not appear in number words, but
LIKE_NUM
does not seem to capture this. I've therefore added optional modifiers (and optional subsequent number words) to allow for English numbers into the trillions.We can also leverage the
numerizer
spaCy extension for converting number words into digits. Unfortunately it is only compatible with English at the moment, but it's great at what it does. Another candidate istext2num
, which already supports multiple languages and seems to be under active development. Otherwise, it should also be feasible to implement a local method when adding new languages in lieu of using an external library, at least for numbers into the thousands or maybe millions. In any case, the changes made in this PR enhance support for English but don't break compatibility for German or French.