RegexNER overwrites CoreNLP NER tags #983

victoriastuart commented 4 years ago

Update / Solution: it was a $CLASSPATH issue. Users reading this can skip most of the content below; here are the key comments.

I'm going to echo @DaveQuinn29 's concern [#910] that there is a cache and/or some other issue in CoreNLP. Issues involved include:

I had been trying to add custom NER tagging via a custom RegexNER file, like I used with the JAVA version a couple of years ago (more recently in Python via stanfordnlp). I can't get RegexNER to work in Python, so I returned to the JAVA implementation of CoreNLP -- from the command line -- to troubleshoot )and work from there, if needed).

java -Xmx16g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP \
-annotators 'tokenize,ssplit,pos,lemma,regexner' \
-regexner.mapping custom_entities.tsv \
-file input_sentence.txt \
-outputFormat text

However, once again I have not had much success in NER tagging with CoreNLP's trained models plus my custom RegexNER TSV file, formatted as described at

Victoria Stuart PERSON      2
p53|p53-mediated    PRGE        2

I've tried various permutations of 'tokenize, ssplit, pos, lemma, ner, regexner (always in that relative order) with -regexner.mapping | -ner.additional.regexner.mapping ... and I cannot simultaneously NER tag text with the default CoreNLP libraries plus my own custom NER tags.

While I can get either NER (the default statistical model + ... fine NER rules added via the regexner annotator) or -regexner.mapping (with my custom tokens file) to work, it's always either one or the other. And before I'm directed there (cough: @J38), I've certainly looked at the page to which we are so often referred.

Furthermore, in evaluating various permutations of annotators, I've found upon stepwise additions adding the lemma annotator is particularly troublesome, immediately breaking RegexNER. And, when I try to step back, old annotator settings are retained (cached?). For example, I get lemmatization, a dependency parse, etc. in the output even if those annotators are not included in the annotators argument list.

It appears that whenever CoreNLP encounters an error, it silently loads the defaults, so that user-defined settings are ignored.

Similar issues / concerns have been raised elsewhere.

If I slowly add annotators one at a time, I can (sort of: not consistently) "reset" CoreNLP:

java -Xmx16g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators 'tokenize,ssplit,pos,lemma,regexner' \
  -regexner.mapping custom_entities.tsv -file input_sentence.txt -outputFormat text; echo; cat input_sentence.txt.out; echo

  Adding annotator tokenize
  No tokenizer type provided. Defaulting to PTBTokenizer.
  Adding annotator ssplit
  Adding annotator pos
  Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.5 sec].
  Adding annotator lemma
  Adding annotator regexner
  TokensRegexNERAnnotator regexner: Read 13 unique entries out of 13 from custom_entities.tsv, 0 TokensRegex patterns.

  Processing file /mnt/Vancouver/apps/CoreNLP/target/input_sentence.txt ... writing to /mnt/Vancouver/apps/CoreNLP/target/input_sentence.txt.out
  Annotating file /mnt/Vancouver/apps/CoreNLP/target/input_sentence.txt ... done [0.2 sec].

  Annotation pipeline timing information:
  TokenizerAnnotator: 0.1 sec.
  WordsToSentencesAnnotator: 0.0 sec.
  POSTaggerAnnotator: 0.0 sec.
  MorphaAnnotator: 0.0 sec.
  TokensRegexNERAnnotator: 0.0 sec.
  TOTAL: 0.2 sec. for 9 tokens at 56.3 tokens/sec.
  Pipeline setup: 0.6 sec.
  Total time for StanfordCoreNLP pipeline: 0.8 sec.

  Document: ID=input_sentence.txt (1 sentences, 9 tokens)
  Sentence #1 (9 tokens):
  Victoria Stuart lives in Vancouver, British Columbia.
  [Text=Victoria CharacterOffsetBegin=0 CharacterOffsetEnd=8 PartOfSpeech=NNP Lemma=Victoria NamedEntityTag=PERSON]
  [Text=Stuart CharacterOffsetBegin=9 CharacterOffsetEnd=15 PartOfSpeech=NNP Lemma=Stuart NamedEntityTag=PERSON]
  [Text=lives CharacterOffsetBegin=16 CharacterOffsetEnd=21 PartOfSpeech=VBZ Lemma=live]
  [Text=in CharacterOffsetBegin=22 CharacterOffsetEnd=24 PartOfSpeech=IN Lemma=in]
  [Text=Vancouver CharacterOffsetBegin=25 CharacterOffsetEnd=34 PartOfSpeech=NNP Lemma=Vancouver]
  [Text=, CharacterOffsetBegin=34 CharacterOffsetEnd=35 PartOfSpeech=, Lemma=,]
  [Text=British CharacterOffsetBegin=36 CharacterOffsetEnd=43 PartOfSpeech=NNP Lemma=British]
  [Text=Columbia CharacterOffsetBegin=44 CharacterOffsetEnd=52 PartOfSpeech=NNP Lemma=Columbia]
  [Text=. CharacterOffsetBegin=52 CharacterOffsetEnd=53 PartOfSpeech=. Lemma=.]

Adding ner annotator breaks RegexNER tagging (above):

java -Xmx16g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators 'tokenize,ssplit,pos,lemma,ner,regexner' \
-regexner.mapping custom_entities.tsv -file input_sentence.txt -outputFormat text; echo; cat input_sentence.txt.out; echo

  Adding annotator tokenize
  No tokenizer type provided. Defaulting to PTBTokenizer.
  Adding annotator ssplit
  Adding annotator pos
  Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.5 sec].
  Adding annotator lemma
  Adding annotator ner
  Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [1.9 sec].
  Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.5 sec].
  Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [1.3 sec].
  Adding annotator regexner
  TokensRegexNERAnnotator regexner: Read 13 unique entries out of 13 from custom_entities.tsv, 0 TokensRegex patterns.

  Processing file /mnt/Vancouver/apps/CoreNLP/target/input_sentence.txt ... writing to /mnt/Vancouver/apps/CoreNLP/target/input_sentence.txt.out
  Annotating file /mnt/Vancouver/apps/CoreNLP/target/input_sentence.txt ... done [0.1 sec].

  Annotation pipeline timing information:
  TokenizerAnnotator: 0.0 sec.
  WordsToSentencesAnnotator: 0.0 sec.
  POSTaggerAnnotator: 0.0 sec.
  MorphaAnnotator: 0.0 sec.
  NERCombinerAnnotator: 0.0 sec.
  TokensRegexNERAnnotator: 0.0 sec.
  TOTAL: 0.1 sec. for 9 tokens at 80.4 tokens/sec.
  Pipeline setup: 4.3 sec.
  Total time for StanfordCoreNLP pipeline: 4.5 sec.

  Document: ID=input_sentence.txt (1 sentences, 9 tokens)
  Sentence #1 (9 tokens):
  Victoria Stuart lives in Vancouver, British Columbia.
  [Text=Victoria CharacterOffsetBegin=0 CharacterOffsetEnd=8 PartOfSpeech=NNP Lemma=Victoria NamedEntityTag=ORGANIZATION]
  [Text=Stuart CharacterOffsetBegin=9 CharacterOffsetEnd=15 PartOfSpeech=NNP Lemma=Stuart NamedEntityTag=ORGANIZATION]
  [Text=lives CharacterOffsetBegin=16 CharacterOffsetEnd=21 PartOfSpeech=VBZ Lemma=live NamedEntityTag=O]
  [Text=in CharacterOffsetBegin=22 CharacterOffsetEnd=24 PartOfSpeech=IN Lemma=in NamedEntityTag=O]
  [Text=Vancouver CharacterOffsetBegin=25 CharacterOffsetEnd=34 PartOfSpeech=NNP Lemma=Vancouver NamedEntityTag=LOCATION]
  [Text=, CharacterOffsetBegin=34 CharacterOffsetEnd=35 PartOfSpeech=, Lemma=, NamedEntityTag=O]
  [Text=British CharacterOffsetBegin=36 CharacterOffsetEnd=43 PartOfSpeech=NNP Lemma=British NamedEntityTag=LOCATION]
  [Text=Columbia CharacterOffsetBegin=44 CharacterOffsetEnd=52 PartOfSpeech=NNP Lemma=Columbia NamedEntityTag=LOCATION]
  [Text=. CharacterOffsetBegin=52 CharacterOffsetEnd=53 PartOfSpeech=. Lemma=. NamedEntityTag=O]

J38 commented 4 years ago

Just to clarify, does this example not work for you? It's key to not include regexner in any way. The ner annotator should be running the entire named entity recognition process, and having the extra regexner could definitely interfere.

java -Xmx4g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.additional.regexner.mapping example-rule.txt -file rule-sentences.txt -outputFormat text

When I run that example I see my rules and statistical model blended together.

victoriastuart commented 4 years ago

@J38 : hello; thank you for your reply. Yes, that is correct: when I run that exact pipeline (your suggestion, above and Example 2, below),

`-annotators tokenize, ssplit, pos, lemma, ner  -ner.additional.regexner.mapping`

I do not get the blended output.

The default CoreNLP tagging -- which tags Victoria (me), Vancouver (city) and Canada (country) as LOCATION, and tags apples bananas as O (OTHER) -- is shown for reference in Example 1.

I only get RegexNER tagging when I include regexner as an annotator (see. e.g., Example 3),

`-annotators tokenize, ssplit, pos, lemma, ner, regexner  -regexner.mapping`


`-annotators tokenize, ssplit, pos, lemma, regexner  -regexner.mapping`

and in those instances there is no blended tagging.

[Suggestion: if you are working from your own machine, where you develop / code CoreNLP packages, please go to a fresh machine and git clone the repo, to make sure that you are running the same code as those of us who get the code that way.]


$ uname -a
Linux victoria 5.4.10-arch1-1 #1 SMP PREEMPT Thu, 09 Jan 2020 10:14:29 +0000 x86_64 GNU/Linux

$ which java

$ java -version
openjdk version "13.0.1" 2019-10-15
OpenJDK Runtime Environment (build 13.0.1+9)
OpenJDK 64-Bit Server VM (build 13.0.1+9, mixed mode)

$ echo $JAVA_HOME


$ cat input_sentences.txt 
Victoria lives in Vancouver, Canada. She likes apples and bananas.

$ cat custom_entities.tsv
apple   FRUIT       2
banana  FRUIT       2

Example 1:

$ java -cp "*" -Xmx4g edu.stanford.nlp.pipeline.StanfordCoreNLP \
-annotators tokenize,ssplit,pos,lemma,ner \
-file input_sentences.txt \
-outputFormat text; \
cat input_sentences.txt.out

Adding annotator tokenize
No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
Adding annotator pos
Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.6 sec].
Adding annotator lemma
Adding annotator ner
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [1.0 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.4 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [0.5 sec].

Processing file /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt ... writing to /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt.out
Annotating file /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt ... done [0.1 sec].

Annotation pipeline timing information:
TokenizerAnnotator: 0.1 sec.
WordsToSentencesAnnotator: 0.0 sec.
POSTaggerAnnotator: 0.0 sec.
MorphaAnnotator: 0.0 sec.
NERCombinerAnnotator: 0.0 sec.
TOTAL: 0.1 sec. for 13 tokens at 91.5 tokens/sec.
Pipeline setup: 2.7 sec.
Total time for StanfordCoreNLP pipeline: 2.9 sec.

Document: ID=input_sentences.txt (2 sentences, 13 tokens)
Sentence #1 (7 tokens):
Victoria lives in Vancouver, Canada.
[Text=Victoria CharacterOffsetBegin=0 CharacterOffsetEnd=8 PartOfSpeech=NNP Lemma=Victoria NamedEntityTag=LOCATION]
[Text=lives CharacterOffsetBegin=9 CharacterOffsetEnd=14 PartOfSpeech=VBZ Lemma=live NamedEntityTag=O]
[Text=in CharacterOffsetBegin=15 CharacterOffsetEnd=17 PartOfSpeech=IN Lemma=in NamedEntityTag=O]
[Text=Vancouver CharacterOffsetBegin=18 CharacterOffsetEnd=27 PartOfSpeech=NNP Lemma=Vancouver NamedEntityTag=LOCATION]
[Text=, CharacterOffsetBegin=27 CharacterOffsetEnd=28 PartOfSpeech=, Lemma=, NamedEntityTag=O]
[Text=Canada CharacterOffsetBegin=29 CharacterOffsetEnd=35 PartOfSpeech=NNP Lemma=Canada NamedEntityTag=LOCATION]
[Text=. CharacterOffsetBegin=35 CharacterOffsetEnd=36 PartOfSpeech=. Lemma=. NamedEntityTag=O]
Sentence #2 (6 tokens):
She likes apples and bananas.
[Text=She CharacterOffsetBegin=37 CharacterOffsetEnd=40 PartOfSpeech=PRP Lemma=she NamedEntityTag=O]
[Text=likes CharacterOffsetBegin=41 CharacterOffsetEnd=46 PartOfSpeech=VBZ Lemma=like NamedEntityTag=O]
[Text=apples CharacterOffsetBegin=47 CharacterOffsetEnd=53 PartOfSpeech=NNS Lemma=apple NamedEntityTag=O]
[Text=and CharacterOffsetBegin=54 CharacterOffsetEnd=57 PartOfSpeech=CC Lemma=and NamedEntityTag=O]
[Text=bananas CharacterOffsetBegin=58 CharacterOffsetEnd=65 PartOfSpeech=NNS Lemma=banana NamedEntityTag=O]
[Text=. CharacterOffsetBegin=65 CharacterOffsetEnd=66 PartOfSpeech=. Lemma=. NamedEntityTag=O]

Example 2:

$ java -cp "*" -Xmx4g edu.stanford.nlp.pipeline.StanfordCoreNLP \
-annotators tokenize,ssplit,pos,lemma,ner \
-ner.additional.regexner.mapping custom_entities.tsv \
-file input_sentences.txt \
-outputFormat text; \
cat input_sentences.txt.out

Adding annotator tokenize
No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
Adding annotator pos
Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.7 sec].
Adding annotator lemma
Adding annotator ner
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [1.0 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.4 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [0.5 sec].

Processing file /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt ... writing to /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt.out
Annotating file /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt ... done [0.1 sec].

Annotation pipeline timing information:
TokenizerAnnotator: 0.1 sec.
WordsToSentencesAnnotator: 0.0 sec.
POSTaggerAnnotator: 0.0 sec.
MorphaAnnotator: 0.0 sec.
NERCombinerAnnotator: 0.0 sec.
TOTAL: 0.1 sec. for 13 tokens at 100.0 tokens/sec.
Pipeline setup: 2.7 sec.
Total time for StanfordCoreNLP pipeline: 2.8 sec.

Document: ID=input_sentences.txt (2 sentences, 13 tokens)
Sentence #1 (7 tokens):
Victoria lives in Vancouver, Canada.
[Text=Victoria CharacterOffsetBegin=0 CharacterOffsetEnd=8 PartOfSpeech=NNP Lemma=Victoria NamedEntityTag=LOCATION]
[Text=lives CharacterOffsetBegin=9 CharacterOffsetEnd=14 PartOfSpeech=VBZ Lemma=live NamedEntityTag=O]
[Text=in CharacterOffsetBegin=15 CharacterOffsetEnd=17 PartOfSpeech=IN Lemma=in NamedEntityTag=O]
[Text=Vancouver CharacterOffsetBegin=18 CharacterOffsetEnd=27 PartOfSpeech=NNP Lemma=Vancouver NamedEntityTag=LOCATION]
[Text=, CharacterOffsetBegin=27 CharacterOffsetEnd=28 PartOfSpeech=, Lemma=, NamedEntityTag=O]
[Text=Canada CharacterOffsetBegin=29 CharacterOffsetEnd=35 PartOfSpeech=NNP Lemma=Canada NamedEntityTag=LOCATION]
[Text=. CharacterOffsetBegin=35 CharacterOffsetEnd=36 PartOfSpeech=. Lemma=. NamedEntityTag=O]
Sentence #2 (6 tokens):
She likes apples and bananas.
[Text=She CharacterOffsetBegin=37 CharacterOffsetEnd=40 PartOfSpeech=PRP Lemma=she NamedEntityTag=O]
[Text=likes CharacterOffsetBegin=41 CharacterOffsetEnd=46 PartOfSpeech=VBZ Lemma=like NamedEntityTag=O]
[Text=apples CharacterOffsetBegin=47 CharacterOffsetEnd=53 PartOfSpeech=NNS Lemma=apple NamedEntityTag=O]
[Text=and CharacterOffsetBegin=54 CharacterOffsetEnd=57 PartOfSpeech=CC Lemma=and NamedEntityTag=O]
[Text=bananas CharacterOffsetBegin=58 CharacterOffsetEnd=65 PartOfSpeech=NNS Lemma=banana NamedEntityTag=O]
[Text=. CharacterOffsetBegin=65 CharacterOffsetEnd=66 PartOfSpeech=. Lemma=. NamedEntityTag=O]

Example 3:

$ java -cp "*" -Xmx4g edu.stanford.nlp.pipeline.StanfordCoreNLP \
-annotators tokenize,ssplit,pos,lemma,regexner \
-regexner.mapping custom_entities.tsv \
-file input_sentences.txt \
-outputFormat text; \
cat input_sentences.txt.out

Adding annotator tokenize
No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
Adding annotator pos
Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.7 sec].
Adding annotator lemma
Adding annotator regexner
TokensRegexNERAnnotator regexner: Read 5 unique entries out of 5 from custom_entities.tsv, 0 TokensRegex patterns.

Processing file /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt ... writing to /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt.out
Annotating file /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt ... done [0.2 sec].

Annotation pipeline timing information:
TokenizerAnnotator: 0.1 sec.
WordsToSentencesAnnotator: 0.0 sec.
POSTaggerAnnotator: 0.0 sec.
MorphaAnnotator: 0.1 sec.
TokensRegexNERAnnotator: 0.0 sec.
TOTAL: 0.2 sec. for 13 tokens at 83.3 tokens/sec.
Pipeline setup: 0.8 sec.
Total time for StanfordCoreNLP pipeline: 1.0 sec.

Document: ID=input_sentences.txt (2 sentences, 13 tokens)
Sentence #1 (7 tokens):
Victoria lives in Vancouver, Canada.
[Text=Victoria CharacterOffsetBegin=0 CharacterOffsetEnd=8 PartOfSpeech=NNP Lemma=Victoria NamedEntityTag=PERSON]
[Text=lives CharacterOffsetBegin=9 CharacterOffsetEnd=14 PartOfSpeech=VBZ Lemma=live]
[Text=in CharacterOffsetBegin=15 CharacterOffsetEnd=17 PartOfSpeech=IN Lemma=in]
[Text=Vancouver CharacterOffsetBegin=18 CharacterOffsetEnd=27 PartOfSpeech=NNP Lemma=Vancouver NamedEntityTag=CITY]
[Text=, CharacterOffsetBegin=27 CharacterOffsetEnd=28 PartOfSpeech=, Lemma=,]
[Text=Canada CharacterOffsetBegin=29 CharacterOffsetEnd=35 PartOfSpeech=NNP Lemma=Canada NamedEntityTag=COUNTRY]
[Text=. CharacterOffsetBegin=35 CharacterOffsetEnd=36 PartOfSpeech=. Lemma=.]
Sentence #2 (6 tokens):
She likes apples and bananas.
[Text=She CharacterOffsetBegin=37 CharacterOffsetEnd=40 PartOfSpeech=PRP Lemma=she]
[Text=likes CharacterOffsetBegin=41 CharacterOffsetEnd=46 PartOfSpeech=VBZ Lemma=like]
[Text=apples CharacterOffsetBegin=47 CharacterOffsetEnd=53 PartOfSpeech=NNS Lemma=apple]
[Text=and CharacterOffsetBegin=54 CharacterOffsetEnd=57 PartOfSpeech=CC Lemma=and]
[Text=bananas CharacterOffsetBegin=58 CharacterOffsetEnd=65 PartOfSpeech=NNS Lemma=banana]
[Text=. CharacterOffsetBegin=65 CharacterOffsetEnd=66 PartOfSpeech=. Lemma=.]
J38 commented 4 years ago

What are the contents of the directory where you are running this command?

J38 commented 4 years ago

Looking over your output, it seems like you're running an older Stanford CoreNLP, because it doesn't appear to be running the fine-grained stuff by default when the ner annotator is specified.

For instance when I run using Stanford CoreNLP 3.9.2 I see this output

$ ~/stanford-corenlp/working_dirs/ner$ echo $CLASSPATH
$ ~/stanford-corenlp/working_dirs/ner$ java -Xmx10g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.additional.regexner.mapping victoria-rules.txt -file victoria-example.txt -outputFormat text
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.7 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[main] INFO - Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [1.8 sec].
[main] INFO - Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.7 sec].
[main] INFO - Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [0.8 sec].
[main] INFO edu.stanford.nlp.time.JollyDayHolidays - Initializing JollyDayHoliday for SUTime from classpath edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml as sutime.binder.1.
[main] INFO edu.stanford.nlp.time.TimeExpressionExtractorImpl - Using following SUTime rules: edu/stanford/nlp/models/sutime/defs.sutime.txt,edu/stanford/nlp/models/sutime/english.sutime.txt,edu/stanford/nlp/models/sutime/english.holidays.sutime.txt
[main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 580704 unique entries out of 581863 from edu/stanford/nlp/models/kbp/english/gazetteers/, 0 TokensRegex patterns.
[main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 4869 unique entries out of 4869 from edu/stanford/nlp/models/kbp/english/gazetteers/, 0 TokensRegex patterns.
[main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 585573 unique entries from 2 files
[main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.additional.regexner: Read 1 unique entries out of 1 from victoria-rules.txt, 0 TokensRegex patterns.

Processing file ~/stanford-corenlp/working_dirs/ner/victoria-example.txt ... writing to ~/stanford-corenlp/working_dirs/ner/victoria-example.txt.out
Annotating file ~/stanford-corenlp/working_dirs/ner/victoria-example.txt ... done [4.7 sec].

Annotation pipeline timing information:
TokenizerAnnotator: 4.5 sec.
WordsToSentencesAnnotator: 0.0 sec.
POSTaggerAnnotator: 0.0 sec.
MorphaAnnotator: 0.1 sec.
NERCombinerAnnotator: 0.1 sec.
TOTAL: 4.7 sec. for 7 tokens at 1.5 tokens/sec.
Pipeline setup: 17.3 sec.
Total time for StanfordCoreNLP pipeline: 22.2 sec.

Also in your Example 2 everything is tagged LOCATION, which indicates the fine-grained NER did not run at all.

J38 commented 4 years ago

But when I look at your output I'm not seeing ner.fine.regexner nor ner.additional.regexner running

J38 commented 4 years ago

So if you're running this in a directory with older Stanford CoreNLP code, the -cp "*" will cause it to use whatever code is in the directory you're running the command...the CORENLP_HOME variable is used by the Python code, the Java code would ignore that...

victoriastuart commented 4 years ago

Hi: sorry: I should have mentioned my classpath. I have two CoreNLP installations:

Here are the details.

[victoria@victoria apps]$ cd CoreNLP
[victoria@victoria CoreNLP]$ cd target

[victoria@victoria target]$ echo $CORENLP_HOME/

[victoria@victoria target]$ java -Xmx4g edu.stanford.nlp.pipeline.StanfordCoreNLP \
-annotators tokenize,ssplit,pos,lemma,ner \
-ner.additional.regexner.mapping custom_entities.tsv \
-file input_sentences.txt \
-outputFormat text; \
cat input_sentences.txt.out

[victoria@victoria target]$ java -cp "*" -Xmx4g edu.stanford.nlp.pipeline.StanfordCoreNLP \
-annotators tokenize,ssplit,pos,lemma,ner \
-ner.additional.regexner.mapping custom_entities.tsv \
-file input_sentences.txt \
-outputFormat text; \
cat input_sentences.txt.out

Adding annotator tokenize
No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
Adding annotator pos
Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.6 sec].
Adding annotator lemma
Adding annotator ner
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [1.0 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.4 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [0.4 sec].

Processing file /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt ... writing to /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt.out
Annotating file /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt ... done [0.2 sec].

Annotation pipeline timing information:
TokenizerAnnotator: 0.1 sec.
WordsToSentencesAnnotator: 0.0 sec.
POSTaggerAnnotator: 0.0 sec.
MorphaAnnotator: 0.1 sec.
NERCombinerAnnotator: 0.0 sec.
TOTAL: 0.2 sec. for 13 tokens at 67.7 tokens/sec.
Pipeline setup: 2.7 sec.
Total time for StanfordCoreNLP pipeline: 2.9 sec.

Document: ID=input_sentences.txt (2 sentences, 13 tokens)
Sentence #1 (7 tokens):
Victoria lives in Vancouver, Canada.
[Text=Victoria CharacterOffsetBegin=0 CharacterOffsetEnd=8 PartOfSpeech=NNP Lemma=Victoria NamedEntityTag=LOCATION]
[Text=lives CharacterOffsetBegin=9 CharacterOffsetEnd=14 PartOfSpeech=VBZ Lemma=live NamedEntityTag=O]
[Text=in CharacterOffsetBegin=15 CharacterOffsetEnd=17 PartOfSpeech=IN Lemma=in NamedEntityTag=O]
[Text=Vancouver CharacterOffsetBegin=18 CharacterOffsetEnd=27 PartOfSpeech=NNP Lemma=Vancouver NamedEntityTag=LOCATION]
[Text=, CharacterOffsetBegin=27 CharacterOffsetEnd=28 PartOfSpeech=, Lemma=, NamedEntityTag=O]
[Text=Canada CharacterOffsetBegin=29 CharacterOffsetEnd=35 PartOfSpeech=NNP Lemma=Canada NamedEntityTag=LOCATION]
[Text=. CharacterOffsetBegin=35 CharacterOffsetEnd=36 PartOfSpeech=. Lemma=. NamedEntityTag=O]
Sentence #2 (6 tokens):
She likes apples and bananas.
[Text=She CharacterOffsetBegin=37 CharacterOffsetEnd=40 PartOfSpeech=PRP Lemma=she NamedEntityTag=O]
[Text=likes CharacterOffsetBegin=41 CharacterOffsetEnd=46 PartOfSpeech=VBZ Lemma=like NamedEntityTag=O]
[Text=apples CharacterOffsetBegin=47 CharacterOffsetEnd=53 PartOfSpeech=NNS Lemma=apple NamedEntityTag=O]
[Text=and CharacterOffsetBegin=54 CharacterOffsetEnd=57 PartOfSpeech=CC Lemma=and NamedEntityTag=O]
[Text=bananas CharacterOffsetBegin=58 CharacterOffsetEnd=65 PartOfSpeech=NNS Lemma=banana NamedEntityTag=O]
[Text=. CharacterOffsetBegin=65 CharacterOffsetEnd=66 PartOfSpeech=. Lemma=. NamedEntityTag=O]

victoriastuart commented 4 years ago


OK, per @J38 's kind comments, this is solved! :-D

$  ## blank


I appended the following to my $CLASSPATH.

$ export CLASSPATH="$CLASSPATH:/mnt/Vancouver/apps/CoreNLP/target/stanford-corenlp-3.9.2.jar:/mnt/Vancouver/apps/CoreNLP/models/stanford-corenlp-models-current.jar:/mnt/Vancouver/apps/CoreNLP/models/stanford-english-corenlp-models-current.jar:/mnt/Vancouver/apps/CoreNLP/models/stanford-english-kbp-corenlp-models-current.jar";

$ for file in `find /mnt/Vancouver/apps/CoreNLP/lib/ -name "*.jar"`; do export CLASSPATH="$CLASSPATH:`realpath $file`"; done



To better follow the annotations, I updated my input test sentences and my RegexNER rules.

$ cat input_sentences.txt 
Victoria lives in Vancouver, Canada. She was born in Nova Scotia. Victoria likes apples and bananas.

$ cat custom_entities.tsv
apple(s)    FRUIT       2
banana(s)   FRUIT       2

Correct output!

$ java -Xmx16g edu.stanford.nlp.pipeline.StanfordCoreNLP \
-annotators tokenize,ssplit,pos,lemma,ner \
-ner.additional.regexner.mapping custom_entities.tsv \
-file input_sentences.txt \
-outputFormat text; \
cat input_sentences.txt.out

[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.5 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[main] INFO - Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [0.9 sec].
[main] INFO - Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.5 sec].
[main] INFO - Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [0.4 sec].
[main] INFO edu.stanford.nlp.time.JollyDayHolidays - Initializing JollyDayHoliday for SUTime from classpath edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml as sutime.binder.1.
[main] INFO edu.stanford.nlp.time.TimeExpressionExtractorImpl - Using following SUTime rules: edu/stanford/nlp/models/sutime/defs.sutime.txt,edu/stanford/nlp/models/sutime/english.sutime.txt,edu/stanford/nlp/models/sutime/english.holidays.sutime.txt
[main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 580705 unique entries out of 581864 from edu/stanford/nlp/models/kbp/english/gazetteers/, 0 TokensRegex patterns.
[main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 4869 unique entries out of 4869 from edu/stanford/nlp/models/kbp/english/gazetteers/, 0 TokensRegex patterns.
[main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 585574 unique entries from 2 files
[main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.additional.regexner: Read 5 unique entries out of 5 from custom_entities.tsv, 0 TokensRegex patterns.

Processing file /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt ... writing to /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt.out
Annotating file /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt ... done [0.4 sec].

Annotation pipeline timing information:
TokenizerAnnotator: 0.1 sec.
WordsToSentencesAnnotator: 0.0 sec.
POSTaggerAnnotator: 0.0 sec.
MorphaAnnotator: 0.0 sec.
NERCombinerAnnotator: 0.2 sec.
TOTAL: 0.4 sec. for 20 tokens at 54.8 tokens/sec.
Pipeline setup: 8.5 sec.
Total time for StanfordCoreNLP pipeline: 9.1 sec.

Document: ID=input_sentences.txt (3 sentences, 20 tokens)
Sentence #1 (7 tokens):
Victoria lives in Vancouver, Canada.

[Text=Victoria CharacterOffsetBegin=0 CharacterOffsetEnd=8 PartOfSpeech=NNP Lemma=Victoria NamedEntityTag=PERSON]
[Text=lives CharacterOffsetBegin=9 CharacterOffsetEnd=14 PartOfSpeech=VBZ Lemma=live NamedEntityTag=O]
[Text=in CharacterOffsetBegin=15 CharacterOffsetEnd=17 PartOfSpeech=IN Lemma=in NamedEntityTag=O]
[Text=Vancouver CharacterOffsetBegin=18 CharacterOffsetEnd=27 PartOfSpeech=NNP Lemma=Vancouver NamedEntityTag=CITY]
[Text=, CharacterOffsetBegin=27 CharacterOffsetEnd=28 PartOfSpeech=, Lemma=, NamedEntityTag=O]
[Text=Canada CharacterOffsetBegin=29 CharacterOffsetEnd=35 PartOfSpeech=NNP Lemma=Canada NamedEntityTag=COUNTRY]
[Text=. CharacterOffsetBegin=35 CharacterOffsetEnd=36 PartOfSpeech=. Lemma=. NamedEntityTag=O]

Extracted the following NER entity mentions:
Victoria    PERSON  LOCATION:0.6059370876590606
Vancouver   CITY    LOCATION:0.9921788688695864
Canada  COUNTRY LOCATION:0.9992413208111567
Sentence #2 (7 tokens):
She was born in Nova Scotia.

[Text=She CharacterOffsetBegin=37 CharacterOffsetEnd=40 PartOfSpeech=PRP Lemma=she NamedEntityTag=O]
[Text=was CharacterOffsetBegin=41 CharacterOffsetEnd=44 PartOfSpeech=VBD Lemma=be NamedEntityTag=O]
[Text=born CharacterOffsetBegin=45 CharacterOffsetEnd=49 PartOfSpeech=VBN Lemma=bear NamedEntityTag=O]
[Text=in CharacterOffsetBegin=50 CharacterOffsetEnd=52 PartOfSpeech=IN Lemma=in NamedEntityTag=O]
[Text=Nova CharacterOffsetBegin=53 CharacterOffsetEnd=57 PartOfSpeech=NNP Lemma=Nova NamedEntityTag=STATE_OR_PROVINCE]
[Text=Scotia CharacterOffsetBegin=58 CharacterOffsetEnd=64 PartOfSpeech=NNP Lemma=Scotia NamedEntityTag=STATE_OR_PROVINCE]
[Text=. CharacterOffsetBegin=64 CharacterOffsetEnd=65 PartOfSpeech=. Lemma=. NamedEntityTag=O]

Extracted the following NER entity mentions:
Nova Scotia STATE_OR_PROVINCE   LOCATION:0.9944154320168771
Sentence #3 (6 tokens):
Victoria likes apples and bananas.

[Text=Victoria CharacterOffsetBegin=66 CharacterOffsetEnd=74 PartOfSpeech=NNP Lemma=Victoria NamedEntityTag=PERSON]
[Text=likes CharacterOffsetBegin=75 CharacterOffsetEnd=80 PartOfSpeech=VBZ Lemma=like NamedEntityTag=O]
[Text=apples CharacterOffsetBegin=81 CharacterOffsetEnd=87 PartOfSpeech=NNS Lemma=apple NamedEntityTag=FRUIT]
[Text=and CharacterOffsetBegin=88 CharacterOffsetEnd=91 PartOfSpeech=CC Lemma=and NamedEntityTag=O]
[Text=bananas CharacterOffsetBegin=92 CharacterOffsetEnd=99 PartOfSpeech=NNS Lemma=banana NamedEntityTag=FRUIT]
[Text=. CharacterOffsetBegin=99 CharacterOffsetEnd=100 PartOfSpeech=. Lemma=. NamedEntityTag=O]

Extracted the following NER entity mentions:
Victoria    PERSON  PERSON:0.5045879288466439
apples  FRUIT   -
bananas FRUIT   -

J38 commented 4 years ago

Ok how about we try these commands and see if that works.

The first sets the CLASSPATH environment variable, the next is just for showing that worked, then since it appears the relevant files are in /mnt/Vancouver/apps/CoreNLP/target you should cd into that directory and run the java command.

Assuming /mnt/Vancouver/apps/CoreNLP/stanford-corenlp-full/stanford-corenlp-full-2018-10-05 is an unaltered download of the 3.9.2 distribution folder, things should work properly.

Please let me know if there are any issues and I can help you troubleshoot more.

export CLASSPATH=/mnt/Vancouver/apps/CoreNLP/stanford-corenlp-full/stanford-corenlp-full-2018-10-05/*:
cd /mnt/Vancouver/apps/CoreNLP/target
java -Xmx5g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.additional.regexner.mapping custom_entities.tsv -file input_sentences.txt -outputFormat text
J38 commented 4 years ago

Oh wait, sorry, I guess it looks like you've got it working!

J38 commented 4 years ago

Any rate for the time being I would recommend working with the official 3.9.2 release, since master of Stanford CoreNLP is a bit messy...we are going to release 4.0.0 over the next few weeks.

victoriastuart commented 4 years ago

Yes: working now! I'll mark this Issue as closed.

Thank you once again, @J38 , for your patient help -- very much appreciated! :+1:

Edit: added to ~/.bashrc:

## Since the following lines will duplicate / add all of the $CLASSPATH information
## every time I `exec bash` the terminal, I first explicitly clear that PATH.
## Alternatively, add to `~/.profile` as described here:

export CLASSPATH=""

export CLASSPATH="$CLASSPATH:/mnt/Vancouver/apps/CoreNLP/target/stanford-corenlp-3.9.2.jar:/mnt/Vancouver/apps/CoreNLP/models/stanford-corenlp-models-current.jar:/mnt/Vancouver/apps/CoreNLP/models/stanford-english-corenlp-models-current.jar:/mnt/Vancouver/apps/CoreNLP/models/stanford-english-kbp-corenlp-models-current.jar";

for file in `find /mnt/Vancouver/apps/CoreNLP/lib/ -name "*.jar"`; do export CLASSPATH="$CLASSPATH:`realpath $file`"; done