stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
http://stanfordnlp.github.io/CoreNLP/
GNU General Public License v3.0
9.68k stars 2.7k forks source link

Tagging values with decimal points to Money using Regex ner #507

Closed MadhuKush closed 7 years ago

MadhuKush commented 7 years ago

Hi Team,

Example Sentence : DRB-Hicom Bhd and Zhejiang Geely Holdings Group have signed a definitive agreement for Geely to acquire a 49.9% stake in the national car manufacturer in a deal worth RM460.3 million.

In the above sentence we want to tag RM460.3 million as money entity. The below is the regex which we used in regex ner of Stanford NLP:

Regex 1: RM[0-9]+.[0-9]+ million MONEY

Regex 2: RM[0-9]+(.[0-9]+) million MONEY

Regex 3: RM[0-9].*million MONEY

None of the above regex matched and RM460.3 million is not getting captured as Money. Instead NLP is just identifying million as money by default and resulting NER:100000.0 We are not sure where is the issue. Could anybody please help us in tagging whole RM460.3 million as money using regexner of Stanford NLP.

Thank you in advance.

J38 commented 7 years ago

This rule will identify it:

RM[0-9]+ .[0-9] million MONEY MISC,NUMBER 1

The issue is that the tokenizer splits on the . .

MadhuKush commented 7 years ago

Hi Team, Sorry for disturbing you again on the same issue.

The format which was suggested RM[0-9]+ .[0-9] million MONEY MISC,NUMBER 1 resulting only in capturing RM460.3 as Money. But the expected value to be captured is million as well Hence the Money entity value should be RM460.3 million and not just RM460.3

Thank you

MadhuKush commented 7 years ago

Also Example: RM460.35 million, RM460.354 million

With the kind of values as above, like after decimal if there are more than 1 digit. NLP is identifying RM460 as Money[0] .35 million as Money[1] Both the with the regex as RM[0-9]+ .[0-9]+ million MONEY

Could you please address on these types of issues concerned with Moeny values. Capturing the required Money value has become a great challenge.

Thank you in advance.

J38 commented 7 years ago

Here is an example with the current code on GitHub.

Rule file: RM [0-9\.]+ million MONEY MISC,NUMBER 1

(note that rule is 4 tab separated columns)

Command:

java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,regexner -regexner.mapping money-rules.txt -file money-example-sentence.txt -outputFormat text

Expected output:

Document: ID=money-example-sentence.txt (1 sentences, 10 tokens)
Sentence #1 (10 tokens):
They have signed a deal worth RM460.3 million.
[Text=They CharacterOffsetBegin=0 CharacterOffsetEnd=4 PartOfSpeech=PRP Lemma=they NamedEntityTag=O]
[Text=have CharacterOffsetBegin=5 CharacterOffsetEnd=9 PartOfSpeech=VBP Lemma=have NamedEntityTag=O]
[Text=signed CharacterOffsetBegin=10 CharacterOffsetEnd=16 PartOfSpeech=VBN Lemma=sign NamedEntityTag=O]
[Text=a CharacterOffsetBegin=17 CharacterOffsetEnd=18 PartOfSpeech=DT Lemma=a NamedEntityTag=O]
[Text=deal CharacterOffsetBegin=19 CharacterOffsetEnd=23 PartOfSpeech=NN Lemma=deal NamedEntityTag=O]
[Text=worth CharacterOffsetBegin=24 CharacterOffsetEnd=29 PartOfSpeech=JJ Lemma=worth NamedEntityTag=O]
[Text=RM CharacterOffsetBegin=30 CharacterOffsetEnd=32 PartOfSpeech=NN Lemma=rm NamedEntityTag=MONEY]
[Text=460.3 CharacterOffsetBegin=32 CharacterOffsetEnd=37 PartOfSpeech=CD Lemma=460.3 NamedEntityTag=MONEY NormalizedNamedEntityTag=4.603E8]
[Text=million CharacterOffsetBegin=38 CharacterOffsetEnd=45 PartOfSpeech=CD Lemma=million NamedEntityTag=MONEY NormalizedNamedEntityTag=4.603E8]
[Text=. CharacterOffsetBegin=45 CharacterOffsetEnd=46 PartOfSpeech=. Lemma=. NamedEntityTag=O]