Closed MadhuKush closed 7 years ago
This rule will identify it:
RM[0-9]+ .[0-9] million MONEY MISC,NUMBER 1
The issue is that the tokenizer splits on the .
.
Hi Team, Sorry for disturbing you again on the same issue.
The format which was suggested RM[0-9]+ .[0-9] million MONEY MISC,NUMBER 1 resulting only in capturing RM460.3 as Money. But the expected value to be captured is million as well Hence the Money entity value should be RM460.3 million and not just RM460.3
Thank you
Also Example: RM460.35 million, RM460.354 million
With the kind of values as above, like after decimal if there are more than 1 digit. NLP is identifying RM460 as Money[0] .35 million as Money[1] Both the with the regex as RM[0-9]+ .[0-9]+ million MONEY
Could you please address on these types of issues concerned with Moeny values. Capturing the required Money value has become a great challenge.
Thank you in advance.
Here is an example with the current code on GitHub.
Rule file:
RM [0-9\.]+ million MONEY MISC,NUMBER 1
(note that rule is 4 tab separated columns)
Command:
java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,regexner -regexner.mapping money-rules.txt -file money-example-sentence.txt -outputFormat text
Expected output:
Document: ID=money-example-sentence.txt (1 sentences, 10 tokens)
Sentence #1 (10 tokens):
They have signed a deal worth RM460.3 million.
[Text=They CharacterOffsetBegin=0 CharacterOffsetEnd=4 PartOfSpeech=PRP Lemma=they NamedEntityTag=O]
[Text=have CharacterOffsetBegin=5 CharacterOffsetEnd=9 PartOfSpeech=VBP Lemma=have NamedEntityTag=O]
[Text=signed CharacterOffsetBegin=10 CharacterOffsetEnd=16 PartOfSpeech=VBN Lemma=sign NamedEntityTag=O]
[Text=a CharacterOffsetBegin=17 CharacterOffsetEnd=18 PartOfSpeech=DT Lemma=a NamedEntityTag=O]
[Text=deal CharacterOffsetBegin=19 CharacterOffsetEnd=23 PartOfSpeech=NN Lemma=deal NamedEntityTag=O]
[Text=worth CharacterOffsetBegin=24 CharacterOffsetEnd=29 PartOfSpeech=JJ Lemma=worth NamedEntityTag=O]
[Text=RM CharacterOffsetBegin=30 CharacterOffsetEnd=32 PartOfSpeech=NN Lemma=rm NamedEntityTag=MONEY]
[Text=460.3 CharacterOffsetBegin=32 CharacterOffsetEnd=37 PartOfSpeech=CD Lemma=460.3 NamedEntityTag=MONEY NormalizedNamedEntityTag=4.603E8]
[Text=million CharacterOffsetBegin=38 CharacterOffsetEnd=45 PartOfSpeech=CD Lemma=million NamedEntityTag=MONEY NormalizedNamedEntityTag=4.603E8]
[Text=. CharacterOffsetBegin=45 CharacterOffsetEnd=46 PartOfSpeech=. Lemma=. NamedEntityTag=O]
Hi Team,
Example Sentence : DRB-Hicom Bhd and Zhejiang Geely Holdings Group have signed a definitive agreement for Geely to acquire a 49.9% stake in the national car manufacturer in a deal worth RM460.3 million.
In the above sentence we want to tag RM460.3 million as money entity. The below is the regex which we used in regex ner of Stanford NLP:
Regex 1: RM[0-9]+.[0-9]+ million MONEY
Regex 2: RM[0-9]+(.[0-9]+) million MONEY
Regex 3: RM[0-9].*million MONEY
None of the above regex matched and RM460.3 million is not getting captured as Money. Instead NLP is just identifying million as money by default and resulting NER:100000.0 We are not sure where is the issue. Could anybody please help us in tagging whole RM460.3 million as money using regexner of Stanford NLP.
Thank you in advance.