Closed victoriastuart closed 4 years ago
Just to clarify, does this example not work for you? It's key to not include regexner
in any way. The ner
annotator should be running the entire named entity recognition process, and having the extra regexner
could definitely interfere.
java -Xmx4g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.additional.regexner.mapping example-rule.txt -file rule-sentences.txt -outputFormat text
When I run that example I see my rules and statistical model blended together.
@J38 : hello; thank you for your reply. Yes, that is correct: when I run that exact pipeline (your suggestion, above and Example 2, below),
`-annotators tokenize, ssplit, pos, lemma, ner -ner.additional.regexner.mapping`
I do not get the blended output.
The default CoreNLP tagging -- which tags Victoria (me), Vancouver (city) and Canada (country) as LOCATION, and tags apples bananas as O (OTHER) -- is shown for reference in Example 1.
I only get RegexNER tagging when I include regexner as an annotator (see. e.g., Example 3),
`-annotators tokenize, ssplit, pos, lemma, ner, regexner -regexner.mapping`
or
`-annotators tokenize, ssplit, pos, lemma, regexner -regexner.mapping`
and in those instances there is no blended tagging.
[Suggestion: if you are working from your own machine, where you develop / code CoreNLP packages, please go to a fresh machine and git clone the repo, to make sure that you are running the same code as those of us who get the code that way.]
Environment:
$ uname -a
Linux victoria 5.4.10-arch1-1 #1 SMP PREEMPT Thu, 09 Jan 2020 10:14:29 +0000 x86_64 GNU/Linux
$ which java
/usr/bin/java
$ java -version
openjdk version "13.0.1" 2019-10-15
OpenJDK Runtime Environment (build 13.0.1+9)
OpenJDK 64-Bit Server VM (build 13.0.1+9, mixed mode)
$ echo $JAVA_HOME
/usr/lib/jvm/java-13-openjdk/bin/java
$ echo $CORENLP_HOME/
/mnt/Vancouver/apps/CoreNLP/stanford-corenlp-full/stanford-corenlp-full-2018-10-05/
$ cat input_sentences.txt
Victoria lives in Vancouver, Canada. She likes apples and bananas.
$ cat custom_entities.tsv
Victoria PERSON LOCATION,ORGANIZATION,CITY 2
Vancouver CITY LOCATION,ORGANIZATION 2
Canada COUNTRY LOCATION,ORGANIZATION,CITY 2
apple FRUIT 2
banana FRUIT 2
Example 1:
$ java -cp "*" -Xmx4g edu.stanford.nlp.pipeline.StanfordCoreNLP \
-annotators tokenize,ssplit,pos,lemma,ner \
-file input_sentences.txt \
-outputFormat text; \
cat input_sentences.txt.out
Adding annotator tokenize
No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
Adding annotator pos
Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.6 sec].
Adding annotator lemma
Adding annotator ner
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [1.0 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.4 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [0.5 sec].
Processing file /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt ... writing to /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt.out
Annotating file /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt ... done [0.1 sec].
Annotation pipeline timing information:
TokenizerAnnotator: 0.1 sec.
WordsToSentencesAnnotator: 0.0 sec.
POSTaggerAnnotator: 0.0 sec.
MorphaAnnotator: 0.0 sec.
NERCombinerAnnotator: 0.0 sec.
TOTAL: 0.1 sec. for 13 tokens at 91.5 tokens/sec.
Pipeline setup: 2.7 sec.
Total time for StanfordCoreNLP pipeline: 2.9 sec.
Document: ID=input_sentences.txt (2 sentences, 13 tokens)
Sentence #1 (7 tokens):
Victoria lives in Vancouver, Canada.
[Text=Victoria CharacterOffsetBegin=0 CharacterOffsetEnd=8 PartOfSpeech=NNP Lemma=Victoria NamedEntityTag=LOCATION]
[Text=lives CharacterOffsetBegin=9 CharacterOffsetEnd=14 PartOfSpeech=VBZ Lemma=live NamedEntityTag=O]
[Text=in CharacterOffsetBegin=15 CharacterOffsetEnd=17 PartOfSpeech=IN Lemma=in NamedEntityTag=O]
[Text=Vancouver CharacterOffsetBegin=18 CharacterOffsetEnd=27 PartOfSpeech=NNP Lemma=Vancouver NamedEntityTag=LOCATION]
[Text=, CharacterOffsetBegin=27 CharacterOffsetEnd=28 PartOfSpeech=, Lemma=, NamedEntityTag=O]
[Text=Canada CharacterOffsetBegin=29 CharacterOffsetEnd=35 PartOfSpeech=NNP Lemma=Canada NamedEntityTag=LOCATION]
[Text=. CharacterOffsetBegin=35 CharacterOffsetEnd=36 PartOfSpeech=. Lemma=. NamedEntityTag=O]
Sentence #2 (6 tokens):
She likes apples and bananas.
[Text=She CharacterOffsetBegin=37 CharacterOffsetEnd=40 PartOfSpeech=PRP Lemma=she NamedEntityTag=O]
[Text=likes CharacterOffsetBegin=41 CharacterOffsetEnd=46 PartOfSpeech=VBZ Lemma=like NamedEntityTag=O]
[Text=apples CharacterOffsetBegin=47 CharacterOffsetEnd=53 PartOfSpeech=NNS Lemma=apple NamedEntityTag=O]
[Text=and CharacterOffsetBegin=54 CharacterOffsetEnd=57 PartOfSpeech=CC Lemma=and NamedEntityTag=O]
[Text=bananas CharacterOffsetBegin=58 CharacterOffsetEnd=65 PartOfSpeech=NNS Lemma=banana NamedEntityTag=O]
[Text=. CharacterOffsetBegin=65 CharacterOffsetEnd=66 PartOfSpeech=. Lemma=. NamedEntityTag=O]
Example 2:
$ java -cp "*" -Xmx4g edu.stanford.nlp.pipeline.StanfordCoreNLP \
-annotators tokenize,ssplit,pos,lemma,ner \
-ner.additional.regexner.mapping custom_entities.tsv \
-file input_sentences.txt \
-outputFormat text; \
cat input_sentences.txt.out
Adding annotator tokenize
No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
Adding annotator pos
Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.7 sec].
Adding annotator lemma
Adding annotator ner
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [1.0 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.4 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [0.5 sec].
Processing file /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt ... writing to /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt.out
Annotating file /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt ... done [0.1 sec].
Annotation pipeline timing information:
TokenizerAnnotator: 0.1 sec.
WordsToSentencesAnnotator: 0.0 sec.
POSTaggerAnnotator: 0.0 sec.
MorphaAnnotator: 0.0 sec.
NERCombinerAnnotator: 0.0 sec.
TOTAL: 0.1 sec. for 13 tokens at 100.0 tokens/sec.
Pipeline setup: 2.7 sec.
Total time for StanfordCoreNLP pipeline: 2.8 sec.
Document: ID=input_sentences.txt (2 sentences, 13 tokens)
Sentence #1 (7 tokens):
Victoria lives in Vancouver, Canada.
[Text=Victoria CharacterOffsetBegin=0 CharacterOffsetEnd=8 PartOfSpeech=NNP Lemma=Victoria NamedEntityTag=LOCATION]
[Text=lives CharacterOffsetBegin=9 CharacterOffsetEnd=14 PartOfSpeech=VBZ Lemma=live NamedEntityTag=O]
[Text=in CharacterOffsetBegin=15 CharacterOffsetEnd=17 PartOfSpeech=IN Lemma=in NamedEntityTag=O]
[Text=Vancouver CharacterOffsetBegin=18 CharacterOffsetEnd=27 PartOfSpeech=NNP Lemma=Vancouver NamedEntityTag=LOCATION]
[Text=, CharacterOffsetBegin=27 CharacterOffsetEnd=28 PartOfSpeech=, Lemma=, NamedEntityTag=O]
[Text=Canada CharacterOffsetBegin=29 CharacterOffsetEnd=35 PartOfSpeech=NNP Lemma=Canada NamedEntityTag=LOCATION]
[Text=. CharacterOffsetBegin=35 CharacterOffsetEnd=36 PartOfSpeech=. Lemma=. NamedEntityTag=O]
Sentence #2 (6 tokens):
She likes apples and bananas.
[Text=She CharacterOffsetBegin=37 CharacterOffsetEnd=40 PartOfSpeech=PRP Lemma=she NamedEntityTag=O]
[Text=likes CharacterOffsetBegin=41 CharacterOffsetEnd=46 PartOfSpeech=VBZ Lemma=like NamedEntityTag=O]
[Text=apples CharacterOffsetBegin=47 CharacterOffsetEnd=53 PartOfSpeech=NNS Lemma=apple NamedEntityTag=O]
[Text=and CharacterOffsetBegin=54 CharacterOffsetEnd=57 PartOfSpeech=CC Lemma=and NamedEntityTag=O]
[Text=bananas CharacterOffsetBegin=58 CharacterOffsetEnd=65 PartOfSpeech=NNS Lemma=banana NamedEntityTag=O]
[Text=. CharacterOffsetBegin=65 CharacterOffsetEnd=66 PartOfSpeech=. Lemma=. NamedEntityTag=O]
Example 3:
$ java -cp "*" -Xmx4g edu.stanford.nlp.pipeline.StanfordCoreNLP \
-annotators tokenize,ssplit,pos,lemma,regexner \
-regexner.mapping custom_entities.tsv \
-file input_sentences.txt \
-outputFormat text; \
cat input_sentences.txt.out
Adding annotator tokenize
No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
Adding annotator pos
Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.7 sec].
Adding annotator lemma
Adding annotator regexner
TokensRegexNERAnnotator regexner: Read 5 unique entries out of 5 from custom_entities.tsv, 0 TokensRegex patterns.
Processing file /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt ... writing to /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt.out
Annotating file /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt ... done [0.2 sec].
Annotation pipeline timing information:
TokenizerAnnotator: 0.1 sec.
WordsToSentencesAnnotator: 0.0 sec.
POSTaggerAnnotator: 0.0 sec.
MorphaAnnotator: 0.1 sec.
TokensRegexNERAnnotator: 0.0 sec.
TOTAL: 0.2 sec. for 13 tokens at 83.3 tokens/sec.
Pipeline setup: 0.8 sec.
Total time for StanfordCoreNLP pipeline: 1.0 sec.
Document: ID=input_sentences.txt (2 sentences, 13 tokens)
Sentence #1 (7 tokens):
Victoria lives in Vancouver, Canada.
[Text=Victoria CharacterOffsetBegin=0 CharacterOffsetEnd=8 PartOfSpeech=NNP Lemma=Victoria NamedEntityTag=PERSON]
[Text=lives CharacterOffsetBegin=9 CharacterOffsetEnd=14 PartOfSpeech=VBZ Lemma=live]
[Text=in CharacterOffsetBegin=15 CharacterOffsetEnd=17 PartOfSpeech=IN Lemma=in]
[Text=Vancouver CharacterOffsetBegin=18 CharacterOffsetEnd=27 PartOfSpeech=NNP Lemma=Vancouver NamedEntityTag=CITY]
[Text=, CharacterOffsetBegin=27 CharacterOffsetEnd=28 PartOfSpeech=, Lemma=,]
[Text=Canada CharacterOffsetBegin=29 CharacterOffsetEnd=35 PartOfSpeech=NNP Lemma=Canada NamedEntityTag=COUNTRY]
[Text=. CharacterOffsetBegin=35 CharacterOffsetEnd=36 PartOfSpeech=. Lemma=.]
Sentence #2 (6 tokens):
She likes apples and bananas.
[Text=She CharacterOffsetBegin=37 CharacterOffsetEnd=40 PartOfSpeech=PRP Lemma=she]
[Text=likes CharacterOffsetBegin=41 CharacterOffsetEnd=46 PartOfSpeech=VBZ Lemma=like]
[Text=apples CharacterOffsetBegin=47 CharacterOffsetEnd=53 PartOfSpeech=NNS Lemma=apple]
[Text=and CharacterOffsetBegin=54 CharacterOffsetEnd=57 PartOfSpeech=CC Lemma=and]
[Text=bananas CharacterOffsetBegin=58 CharacterOffsetEnd=65 PartOfSpeech=NNS Lemma=banana]
[Text=. CharacterOffsetBegin=65 CharacterOffsetEnd=66 PartOfSpeech=. Lemma=.]
What are the contents of the directory where you are running this command?
Looking over your output, it seems like you're running an older Stanford CoreNLP, because it doesn't appear to be running the fine-grained stuff by default when the ner
annotator is specified.
For instance when I run using Stanford CoreNLP 3.9.2 I see this output
$ ~/stanford-corenlp/working_dirs/ner$ echo $CLASSPATH
~/stanford-corenlp/3.9.2/*:
$ ~/stanford-corenlp/working_dirs/ner$ java -Xmx10g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.additional.regexner.mapping victoria-rules.txt -file victoria-example.txt -outputFormat text
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.7 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [1.8 sec].
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.7 sec].
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [0.8 sec].
[main] INFO edu.stanford.nlp.time.JollyDayHolidays - Initializing JollyDayHoliday for SUTime from classpath edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml as sutime.binder.1.
[main] INFO edu.stanford.nlp.time.TimeExpressionExtractorImpl - Using following SUTime rules: edu/stanford/nlp/models/sutime/defs.sutime.txt,edu/stanford/nlp/models/sutime/english.sutime.txt,edu/stanford/nlp/models/sutime/english.holidays.sutime.txt
[main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 580704 unique entries out of 581863 from edu/stanford/nlp/models/kbp/english/gazetteers/regexner_caseless.tab, 0 TokensRegex patterns.
[main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 4869 unique entries out of 4869 from edu/stanford/nlp/models/kbp/english/gazetteers/regexner_cased.tab, 0 TokensRegex patterns.
[main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 585573 unique entries from 2 files
[main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.additional.regexner: Read 1 unique entries out of 1 from victoria-rules.txt, 0 TokensRegex patterns.
Processing file ~/stanford-corenlp/working_dirs/ner/victoria-example.txt ... writing to ~/stanford-corenlp/working_dirs/ner/victoria-example.txt.out
Annotating file ~/stanford-corenlp/working_dirs/ner/victoria-example.txt ... done [4.7 sec].
Annotation pipeline timing information:
TokenizerAnnotator: 4.5 sec.
WordsToSentencesAnnotator: 0.0 sec.
POSTaggerAnnotator: 0.0 sec.
MorphaAnnotator: 0.1 sec.
NERCombinerAnnotator: 0.1 sec.
TOTAL: 4.7 sec. for 7 tokens at 1.5 tokens/sec.
Pipeline setup: 17.3 sec.
Total time for StanfordCoreNLP pipeline: 22.2 sec.
Also in your Example 2 everything is tagged LOCATION, which indicates the fine-grained NER did not run at all.
But when I look at your output I'm not seeing ner.fine.regexner
nor ner.additional.regexner
running
So if you're running this in a directory with older Stanford CoreNLP code, the -cp "*"
will cause it to use whatever code is in the directory you're running the command...the CORENLP_HOME
variable is used by the Python code, the Java code would ignore that...
Hi: sorry: I should have mentioned my classpath. I have two CoreNLP installations:
git clone https://github.com/stanfordnlp/CoreNLP
, that I have been using to troubleshoot this issue.Here are the details.
[victoria@victoria stanford-corenlp-full-2018-10-05]$ pwd; ls -l
/mnt/Vancouver/apps/CoreNLP/stanford-corenlp-full/stanford-corenlp-full-2018-10-05
total 386064
-rw-rw-r-- 1 victoria victoria 6103 Oct 8 2018 build.xml
-rwxrwxr-x 1 victoria victoria 871 Oct 8 2018 corenlp.sh
-rwxrwxr-x 1 victoria victoria 5477 Oct 8 2018 CoreNLP-to-HTML.xsl
-rw-r--r-- 1 victoria victoria 101 Jan 9 16:39 custom_entities.tsv
-rw-rw-r-- 1 victoria victoria 211938 Oct 8 2018 ejml-0.23.jar
-rw-rw-r-- 1 victoria victoria 1227451 Oct 8 2018 ejml-0.23-src.zip
-rw-r--r-- 1 victoria victoria 36 Jan 9 16:33 input_sentence.txt
-rw-rw-r-- 1 victoria victoria 89 Oct 8 2018 input.txt
-rw-rw-r-- 1 victoria victoria 19868 Oct 8 2018 input.txt.xml
-rw-rw-r-- 1 victoria victoria 56674 Oct 8 2018 javax.activation-api-1.2.0.jar
-rw-rw-r-- 1 victoria victoria 78896 Oct 8 2018 javax.activation-api-1.2.0-sources.jar
-rw-rw-r-- 1 victoria victoria 54860 Oct 8 2018 javax.json-api-1.0-sources.jar
-rw-rw-r-- 1 victoria victoria 85147 Oct 8 2018 javax.json.jar
-rw-rw-r-- 1 victoria victoria 128032 Oct 8 2018 jaxb-api-2.4.0-b180830.0359.jar
-rw-rw-r-- 1 victoria victoria 270926 Oct 8 2018 jaxb-api-2.4.0-b180830.0359-sources.jar
-rw-rw-r-- 1 victoria victoria 254858 Oct 8 2018 jaxb-core-2.3.0.1.jar
-rw-rw-r-- 1 victoria victoria 345974 Oct 8 2018 jaxb-core-2.3.0.1-sources.jar
-rw-rw-r-- 1 victoria victoria 1099271 Oct 8 2018 jaxb-impl-2.4.0-b180830.0438.jar
-rw-rw-r-- 1 victoria victoria 1132702 Oct 8 2018 jaxb-impl-2.4.0-b180830.0438-sources.jar
-rw-rw-r-- 1 victoria victoria 774317 Oct 8 2018 joda-time-2.9-sources.jar
-rw-rw-r-- 1 victoria victoria 629506 Oct 8 2018 joda-time.jar
-rw-rw-r-- 1 victoria victoria 196945 Oct 8 2018 jollyday-0.4.9-sources.jar
-rw-rw-r-- 1 victoria victoria 213591 Oct 8 2018 jollyday.jar
-rw-rw-r-- 1 victoria victoria 1667 Oct 8 2018 LIBRARY-LICENSES
-rw-rw-r-- 1 victoria victoria 35147 Oct 8 2018 LICENSE.txt
-rw-rw-r-- 1 victoria victoria 769 Oct 8 2018 Makefile
drwxrwxr-x 2 victoria victoria 4096 Oct 8 2018 patterns
-rw-rw-r-- 1 victoria victoria 6279 Oct 8 2018 pom-java-11.xml
-rw-rw-r-- 1 victoria victoria 6135 Oct 8 2018 pom.xml
-rw-rw-r-- 1 victoria victoria 1347123 Oct 8 2018 protobuf.jar
-rw-rw-r-- 1 victoria victoria 4262 Oct 8 2018 README.txt
-rw-r--r-- 1 victoria victoria 2698 Jan 8 20:35 regexner.props
-rw-rw-r-- 1 victoria victoria 367 Oct 8 2018 RESOURCE-LICENSES
-rw-rw-r-- 1 victoria victoria 2445 Oct 8 2018 SemgrexDemo.java
-rw-r--r-- 1 victoria victoria 1720 Jan 7 18:06 serialized.props
-rw-rw-r-- 1 victoria victoria 1828 Oct 8 2018 ShiftReduceDemo.java
-rw-rw-r-- 1 victoria victoria 32127 Oct 8 2018 slf4j-api.jar
-rw-rw-r-- 1 victoria victoria 10712 Oct 8 2018 slf4j-simple.jar
-rw-rw-r-- 1 victoria victoria 8146873 Oct 8 2018 stanford-corenlp-3.9.2.jar
-rw-rw-r-- 1 victoria victoria 9687426 Oct 8 2018 stanford-corenlp-3.9.2-javadoc.jar
-rw-rw-r-- 1 victoria victoria 362565193 Oct 8 2018 stanford-corenlp-3.9.2-models.jar
-rw-rw-r-- 1 victoria victoria 5370905 Oct 8 2018 stanford-corenlp-3.9.2-sources.jar
-rw-rw-r-- 1 victoria victoria 7240 Oct 8 2018 StanfordCoreNlpDemo.java
-rw-rw-r-- 1 victoria victoria 199885 Oct 8 2018 StanfordDependenciesManual.pdf
drwxrwxr-x 2 victoria victoria 4096 Oct 8 2018 sutime
-rw-r--r-- 1 victoria victoria 702 Jan 8 14:51 text.props
drwxrwxr-x 2 victoria victoria 4096 Oct 8 2018 tokensregex
-rw-rw-r-- 1 victoria victoria 672122 Oct 8 2018 xom-1.2.10-src.jar
-rw-rw-r-- 1 victoria victoria 313253 Oct 8 2018 xom.jar
[victoria@victoria stanford-corenlp-full-2018-10-05]$
[victoria@victoria CoreNLP]$ date; pwd
Tue 10 Dec 2019 11:46:09 AM PST
/mnt/Vancouver/apps/CoreNLP
[victoria@victoria CoreNLP]$ git pull
Already up to date.
[victoria@victoria CoreNLP]$ mvn package
...
[ ... SNIP! ... ]
Tests run: 15, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.006 sec
Results :
Tests run: 1241, Failures: 0, Errors: 0, Skipped: 1
[INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ stanford-corenlp ---
[INFO] Building jar: /mnt/Vancouver/apps/CoreNLP/target/stanford-corenlp-3.9.2.jar
[INFO] --- build-helper-maven-plugin:1.7:attach-artifact (attach-models) @ stanford-corenlp ---
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/maven/maven-plugin-api/2.0/maven-plugin-api-2.0.pom
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/maven/maven-plugin-api/2.0/maven-plugin-api-2.0.pom (601 B at 22 kB/s)
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/maven/maven/2.0/maven-2.0.pom
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/maven/maven/2.0/maven-2.0.pom (8.8 kB at 283 kB/s)
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 43.980 s
[INFO] Finished at: 2019-12-10T11:54:57-08:00
[INFO] ------------------------------------------------------------------------
I have been running CoreNLP here, where I have CoreNLP git cloned:
[victoria@victoria ~]$ pwd
[victoria@victoria target]$ pwd
/mnt/Vancouver/apps/CoreNLP/target
[victoria@victoria target]$ java -cp "*" -Xmx4g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,regexner -file input_sentences.txt -outputFormat text; echo; cat input_sentences.txt.out
# ----------------------------------------------------------------------------
~/.bashrc:
# export CORENLP_HOME=/mnt/Vancouver/apps/CoreNLP/target
export CORENLP_HOME=/mnt/Vancouver/apps/CoreNLP/stanford-corenlp-full/stanford-corenlp-full-2018-10-05
<<COMMENT
2020-01-10:
[victoria@victoria RegexNER]$ java -Xmx16g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP
Error: Could not find or load main class edu.stanford.nlp.pipeline.StanfordCoreNLP
cd /mnt/Vancouver/apps/CoreNLP/target
[victoria@victoria target]$ java -Xmx16g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP
Searching for resource: StanfordCoreNLP.properties ... not found.
Searching for resource: edu/stanford/nlp/pipeline/StanfordCoreNLP.properties ... found.
Adding annotator tokenize
No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
Adding annotator pos
Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.5 sec].
Adding annotator lemma
Adding annotator ner
...
THIS WORKS FROM ANY DIR:
java -Xmx16g -cp '/mnt/Vancouver/apps/CoreNLP/target/*' edu.stanford.nlp.pipeline.StanfordCoreNLP
FOR THIS, MUST cd TO /mnt/Vancouver/apps/CoreNLP/target/
cd /mnt/Vancouver/apps/CoreNLP/target/
java -Xmx16g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP
COMMENT
# ----------------------------------------------------------------------------
[victoria@victoria target]$ echo $CORENLP_HOME/
/mnt/Vancouver/apps/CoreNLP/stanford-corenlp-full/stanford-corenlp-full-2018-10-05/
[victoria@victoria target]$ export CORENLP_HOME=/mnt/Vancouver/apps/CoreNLP/target
[victoria@victoria target]$ exec bash
[victoria@victoria target]$ echo $CORENLP_HOME/
/mnt/Vancouver/apps/CoreNLP/target/
# ----------------------------------------------------------------------------
[victoria@victoria apps]$ cd CoreNLP
[victoria@victoria CoreNLP]$ pwd; ls -l
/mnt/Vancouver/apps/CoreNLP
total 17944
-rw-r--r-- 1 victoria victoria 1311 Dec 6 13:55 build.gradle
-rw-r--r-- 1 victoria victoria 27113 Dec 6 13:55 build.xml
drwxr-xr-x 2 victoria victoria 4096 Jul 7 2017 classes
-rw-r--r-- 1 victoria victoria 4901 Jul 7 2017 commonbuildjsp.xml
-rw-r--r-- 1 victoria victoria 1824 Jul 7 2017 CONTRIBUTING.md
-rw-r--r-- 1 victoria victoria 3197 Dec 10 14:29 corenlp_test.py
drwxr-xr-x 4 victoria victoria 4096 Jul 7 2017 data
drwxr-xr-x 14 victoria victoria 4096 Dec 6 13:55 doc
drwxr-xr-x 3 victoria victoria 4096 Dec 6 13:55 examples
drwxr-xr-x 3 victoria victoria 4096 Jul 7 2017 gradle
-rwxr-xr-x 1 victoria victoria 5241 Jul 7 2017 gradlew
-rw-r--r-- 1 victoria victoria 2260 Jul 7 2017 gradlew.bat
drwxr-xr-x 2 victoria victoria 4096 Dec 12 14:45 input
drwxr-xr-x 3 victoria victoria 4096 Jul 7 2017 itest
-rw-r--r-- 1 victoria victoria 8166 Jul 7 2017 JavaNLP-core.eml
-rw-r--r-- 1 victoria victoria 129 Jul 7 2017 JavaNLP-core.iml
drwxr-xr-x 3 victoria victoria 4096 Jan 9 16:19 lib
drwxr-xr-x 2 victoria victoria 4096 Jul 7 2017 liblocal
drwxr-xr-x 3 victoria victoria 4096 Jan 9 16:19 libsrc
drwxr-xr-x 4 victoria victoria 4096 Jul 7 2017 licenses
-rw-r--r-- 1 victoria victoria 35147 Jul 7 2017 LICENSE.txt
-rw-r--r-- 1 victoria victoria 3391 Jul 7 2017 module_core.xml
drwxr-xr-x 2 victoria victoria 4096 Dec 10 20:53 output
-rw-r--r-- 1 victoria victoria 6374 Jan 9 16:19 pom-java-11.xml
-rw-r--r-- 1 victoria victoria 6221 Jan 9 16:19 pom.xml
-rw-r--r-- 1 victoria victoria 7935 Dec 6 13:55 README.md
-rw-r--r-- 1 victoria victoria 74539 Jan 10 12:19 _readme-victoria-CoreNLP-StanfordNLP-notes.txt
-rw-r--r-- 1 victoria victoria 196676 Dec 30 20:05 _readme-victoria-corenlp.txt
-rw-r--r-- 1 victoria victoria 10638 Dec 24 17:21 _readme-victoria-stanford_openie.txt
-rw-r--r-- 1 victoria victoria 367 Dec 6 13:55 RESOURCE-LICENSES
drwxr-xr-x 11 victoria victoria 4096 Dec 6 13:55 scripts
-rw-r--r-- 1 victoria victoria 12326806 Dec 17 19:55 spacy
drwxr-xr-x 3 victoria victoria 4096 Aug 18 2017 src
drwxr-xr-x 6 victoria victoria 4096 Dec 31 15:32 src-local
drwxr-xr-x 3 victoria victoria 4096 Dec 12 16:50 stanford-corenlp-full
-rw-r--r-- 1 victoria victoria 5528127 Dec 10 14:37 stanfordnlp
drwxr-xr-x 11 victoria victoria 4096 Jan 12 18:46 target
drwxr-xr-x 4 victoria victoria 4096 Jul 7 2017 test
drwxr-xr-x 3 victoria victoria 4096 Jan 3 15:16 _victoria
drwxr-xr-x 7 victoria victoria 4096 Nov 7 2017 web
[victoria@victoria CoreNLP]$ cd target
[victoria@victoria target]$ ls -l
total 1849924
drwxr-xr-x 3 victoria victoria 4096 Jul 7 2017 classes
-rw-r--r-- 1 victoria victoria 159 Jan 12 19:24 custom_entities2.tsv
-rw-r--r-- 1 victoria victoria 159 Jan 12 19:25 custom_entities.tsv
-rw-r--r-- 1 victoria victoria 419 Jan 9 17:23 custom_entities.tsv.bak
-rw-r--r-- 1 victoria victoria 2541 Aug 9 2017 DependencyTreeExample.class
-rw-r--r-- 1 victoria victoria 1430 Aug 9 2017 DependencyTreeExample.java
-rw-r--r-- 1 victoria victoria 85 Jan 12 18:36 fruit.rules
drwxr-xr-x 3 victoria victoria 4096 Jul 7 2017 generated-sources
drwxr-xr-x 3 victoria victoria 4096 Jul 7 2017 generated-test-sources
drwxr-xr-x 3 victoria victoria 12288 Aug 24 2017 icons
-rw-r--r-- 1 victoria victoria 78 Jan 10 13:31 input_sentence_2.txt
-rw-r--r-- 1 victoria victoria 67 Jan 12 19:17 input_sentences.txt
-rw-r--r-- 1 victoria victoria 1528 Jan 12 21:54 input_sentences.txt.out
-rw-r--r-- 1 victoria victoria 54 Jan 12 18:42 input_sentence.txt
-rw-r--r-- 1 victoria victoria 2857 Jan 10 17:04 input_sentence.txt.json
-rw-r--r-- 1 victoria victoria 1117 Jan 12 16:24 input_sentence.txt.out
-rw-r--r-- 1 victoria victoria 0 Jan 10 16:59 input_sentence.txt.xml
drwxr-xr-x 2 victoria victoria 4096 Jul 7 2017 maven-archiver
drwxr-xr-x 3 victoria victoria 4096 Jul 7 2017 maven-status
-rw-r--r-- 1 victoria victoria 3650 Jan 12 18:33 rule-sentences.txt.out
-rw-r--r-- 1 victoria victoria 26 Jan 12 18:37 sentence.txt
-rw-r--r-- 1 victoria victoria 702 Jan 12 18:44 sentence.txt.out
-rw-r--r-- 1 victoria victoria 85 Jan 12 18:36 sports_teams.rules
-rw-r--r-- 1 victoria victoria 9106502 Aug 18 2017 stanford-corenlp-3.7.0.jar
-rw-r--r-- 1 victoria victoria 9446305 Jan 11 12:04 stanford-corenlp-3.9.2.jar
-rw-r--r-- 1 victoria victoria 362594065 Jul 7 2017 stanford-corenlp-models-current.jar
-rw-r--r-- 1 victoria victoria 1039009129 Jul 7 2017 stanford-english-corenlp-models-current.jar
-rw-r--r-- 1 victoria victoria 474001837 Jul 7 2017 stanford-english-kbp-corenlp-models-current.jar
drwxr-xr-x 2 victoria victoria 36864 Dec 10 11:54 surefire-reports
drwxr-xr-x 3 victoria victoria 4096 Jul 7 2017 test-classes
[victoria@victoria target]$ echo $CORENLP_HOME/
/mnt/Vancouver/apps/CoreNLP/target/
[victoria@victoria target]$ java -Xmx4g edu.stanford.nlp.pipeline.StanfordCoreNLP \
-annotators tokenize,ssplit,pos,lemma,ner \
-ner.additional.regexner.mapping custom_entities.tsv \
-file input_sentences.txt \
-outputFormat text; \
cat input_sentences.txt.out
Error: Could not find or load main class edu.stanford.nlp.pipeline.StanfordCoreNLP
Caused by: java.lang.ClassNotFoundException: edu.stanford.nlp.pipeline.StanfordCoreNLP
Document: ID=input_sentences.txt (2 sentences, 13 tokens)
Sentence #1 (7 tokens):
Victoria lives in Vancouver, Canada.
[Text=Victoria CharacterOffsetBegin=0 CharacterOffsetEnd=8 PartOfSpeech=NNP Lemma=Victoria NamedEntityTag=LOCATION]
[Text=lives CharacterOffsetBegin=9 CharacterOffsetEnd=14 PartOfSpeech=VBZ Lemma=live NamedEntityTag=O]
[Text=in CharacterOffsetBegin=15 CharacterOffsetEnd=17 PartOfSpeech=IN Lemma=in NamedEntityTag=O]
[Text=Vancouver CharacterOffsetBegin=18 CharacterOffsetEnd=27 PartOfSpeech=NNP Lemma=Vancouver NamedEntityTag=LOCATION]
[Text=, CharacterOffsetBegin=27 CharacterOffsetEnd=28 PartOfSpeech=, Lemma=, NamedEntityTag=O]
[Text=Canada CharacterOffsetBegin=29 CharacterOffsetEnd=35 PartOfSpeech=NNP Lemma=Canada NamedEntityTag=LOCATION]
[Text=. CharacterOffsetBegin=35 CharacterOffsetEnd=36 PartOfSpeech=. Lemma=. NamedEntityTag=O]
Sentence #2 (6 tokens):
She likes apples and bananas.
[Text=She CharacterOffsetBegin=37 CharacterOffsetEnd=40 PartOfSpeech=PRP Lemma=she NamedEntityTag=O]
[Text=likes CharacterOffsetBegin=41 CharacterOffsetEnd=46 PartOfSpeech=VBZ Lemma=like NamedEntityTag=O]
[Text=apples CharacterOffsetBegin=47 CharacterOffsetEnd=53 PartOfSpeech=NNS Lemma=apple NamedEntityTag=O]
[Text=and CharacterOffsetBegin=54 CharacterOffsetEnd=57 PartOfSpeech=CC Lemma=and NamedEntityTag=O]
[Text=bananas CharacterOffsetBegin=58 CharacterOffsetEnd=65 PartOfSpeech=NNS Lemma=banana NamedEntityTag=O]
[Text=. CharacterOffsetBegin=65 CharacterOffsetEnd=66 PartOfSpeech=. Lemma=. NamedEntityTag=O]
[victoria@victoria target]$ java -cp "*" -Xmx4g edu.stanford.nlp.pipeline.StanfordCoreNLP \
-annotators tokenize,ssplit,pos,lemma,ner \
-ner.additional.regexner.mapping custom_entities.tsv \
-file input_sentences.txt \
-outputFormat text; \
cat input_sentences.txt.out
Adding annotator tokenize
No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
Adding annotator pos
Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.6 sec].
Adding annotator lemma
Adding annotator ner
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [1.0 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.4 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [0.4 sec].
Processing file /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt ... writing to /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt.out
Annotating file /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt ... done [0.2 sec].
Annotation pipeline timing information:
TokenizerAnnotator: 0.1 sec.
WordsToSentencesAnnotator: 0.0 sec.
POSTaggerAnnotator: 0.0 sec.
MorphaAnnotator: 0.1 sec.
NERCombinerAnnotator: 0.0 sec.
TOTAL: 0.2 sec. for 13 tokens at 67.7 tokens/sec.
Pipeline setup: 2.7 sec.
Total time for StanfordCoreNLP pipeline: 2.9 sec.
Document: ID=input_sentences.txt (2 sentences, 13 tokens)
Sentence #1 (7 tokens):
Victoria lives in Vancouver, Canada.
[Text=Victoria CharacterOffsetBegin=0 CharacterOffsetEnd=8 PartOfSpeech=NNP Lemma=Victoria NamedEntityTag=LOCATION]
[Text=lives CharacterOffsetBegin=9 CharacterOffsetEnd=14 PartOfSpeech=VBZ Lemma=live NamedEntityTag=O]
[Text=in CharacterOffsetBegin=15 CharacterOffsetEnd=17 PartOfSpeech=IN Lemma=in NamedEntityTag=O]
[Text=Vancouver CharacterOffsetBegin=18 CharacterOffsetEnd=27 PartOfSpeech=NNP Lemma=Vancouver NamedEntityTag=LOCATION]
[Text=, CharacterOffsetBegin=27 CharacterOffsetEnd=28 PartOfSpeech=, Lemma=, NamedEntityTag=O]
[Text=Canada CharacterOffsetBegin=29 CharacterOffsetEnd=35 PartOfSpeech=NNP Lemma=Canada NamedEntityTag=LOCATION]
[Text=. CharacterOffsetBegin=35 CharacterOffsetEnd=36 PartOfSpeech=. Lemma=. NamedEntityTag=O]
Sentence #2 (6 tokens):
She likes apples and bananas.
[Text=She CharacterOffsetBegin=37 CharacterOffsetEnd=40 PartOfSpeech=PRP Lemma=she NamedEntityTag=O]
[Text=likes CharacterOffsetBegin=41 CharacterOffsetEnd=46 PartOfSpeech=VBZ Lemma=like NamedEntityTag=O]
[Text=apples CharacterOffsetBegin=47 CharacterOffsetEnd=53 PartOfSpeech=NNS Lemma=apple NamedEntityTag=O]
[Text=and CharacterOffsetBegin=54 CharacterOffsetEnd=57 PartOfSpeech=CC Lemma=and NamedEntityTag=O]
[Text=bananas CharacterOffsetBegin=58 CharacterOffsetEnd=65 PartOfSpeech=NNS Lemma=banana NamedEntityTag=O]
[Text=. CharacterOffsetBegin=65 CharacterOffsetEnd=66 PartOfSpeech=. Lemma=. NamedEntityTag=O]
[victoria@victoria target]$
SOLUTION
OK, per @J38 's kind comments, this is solved! :-D
$ echo $CLASSPATH
$ ## blank
Per:
I appended the following to my $CLASSPATH.
$ export CLASSPATH="$CLASSPATH:/mnt/Vancouver/apps/CoreNLP/target/stanford-corenlp-3.9.2.jar:/mnt/Vancouver/apps/CoreNLP/models/stanford-corenlp-models-current.jar:/mnt/Vancouver/apps/CoreNLP/models/stanford-english-corenlp-models-current.jar:/mnt/Vancouver/apps/CoreNLP/models/stanford-english-kbp-corenlp-models-current.jar";
$ for file in `find /mnt/Vancouver/apps/CoreNLP/lib/ -name "*.jar"`; do export CLASSPATH="$CLASSPATH:`realpath $file`"; done
$ echo $CLASSPATH
:/mnt/Vancouver/apps/CoreNLP/target/stanford-corenlp-3.9.2.jar:/mnt/Vancouver/apps/CoreNLP/models/stanford-corenlp-models-current.jar:/mnt/Vancouver/apps/CoreNLP/models/stanford-english-corenlp-models-current.jar:/mnt/Vancouver/apps/CoreNLP/models/stanford-english-kbp-corenlp-models-current.jar:/mnt/Vancouver/apps/CoreNLP/lib/jaxb-api-2.4.0-b180830.0359.jar:/mnt/Vancouver/apps/CoreNLP/lib/jollyday-0.4.9.jar:/mnt/Vancouver/apps/CoreNLP/lib/commons-logging.jar:/mnt/Vancouver/apps/CoreNLP/lib/tomcat/jasper-el.jar:/mnt/Vancouver/apps/CoreNLP/lib/tomcat/jsp-api.jar:/mnt/Vancouver/apps/CoreNLP/lib/tomcat/tomcat-api.jar:/mnt/Vancouver/apps/CoreNLP/lib/tomcat/jasper.jar:/mnt/Vancouver/apps/CoreNLP/lib/tomcat/el-api.jar:/mnt/Vancouver/apps/CoreNLP/lib/tomcat/tomcat-juli.jar:/mnt/Vancouver/apps/CoreNLP/lib/ejml-ddense-0.38.jar:/mnt/Vancouver/apps/CoreNLP/lib/ejml-simple-0.38.jar:/mnt/Vancouver/apps/CoreNLP/lib/ant-contrib-1.0b3.jar:/mnt/Vancouver/apps/CoreNLP/lib/jaxb-impl-2.4.0-b180830.0438.jar:/mnt/Vancouver/apps/CoreNLP/lib/jaxb-core-2.3.0.1.jar:/mnt/Vancouver/apps/CoreNLP/lib/jflex-1.6.1.jar:/mnt/Vancouver/apps/CoreNLP/lib/lucene-core-7.5.0.jar:/mnt/Vancouver/apps/CoreNLP/lib/lucene-analyzers-common-7.5.0.jar:/mnt/Vancouver/apps/CoreNLP/lib/junit.jar:/mnt/Vancouver/apps/CoreNLP/lib/joda-time.jar:/mnt/Vancouver/apps/CoreNLP/lib/protobuf.jar:/mnt/Vancouver/apps/CoreNLP/lib/javax.servlet.jar:/mnt/Vancouver/apps/CoreNLP/lib/javacc.jar:/mnt/Vancouver/apps/CoreNLP/lib/ejml-core-0.38.jar:/mnt/Vancouver/apps/CoreNLP/lib/lucene-queryparser-7.5.0.jar:/mnt/Vancouver/apps/CoreNLP/lib/xom-1.3.2.jar:/mnt/Vancouver/apps/CoreNLP/lib/AppleJavaExtensions.jar:/mnt/Vancouver/apps/CoreNLP/lib/javax.activation-api-1.2.0.jar:/mnt/Vancouver/apps/CoreNLP/lib/lucene-demo-7.5.0.jar:/mnt/Vancouver/apps/CoreNLP/lib/javax.json.jar:/mnt/Vancouver/apps/CoreNLP/lib/log4j-1.2.16.jar:/mnt/Vancouver/apps/CoreNLP/lib/commons-lang3-3.1.jar:/mnt/Vancouver/apps/CoreNLP/lib/slf4j-simple.jar:/mnt/Vancouver/apps/CoreNLP/lib/appbundler-1.0.jar:/mnt/Vancouver/apps/CoreNLP/lib/slf4j-api.jar
To better follow the annotations, I updated my input test sentences and my RegexNER rules.
$ cat input_sentences.txt
Victoria lives in Vancouver, Canada. She was born in Nova Scotia. Victoria likes apples and bananas.
$ cat custom_entities.tsv
Victoria PERSON LOCATION,ORGANIZATION,CITY 2
Vancouver CITY LOCATION,ORGANIZATION 2
Canada COUNTRY LOCATION,ORGANIZATION,CITY 2
apple(s) FRUIT 2
banana(s) FRUIT 2
Correct output!
$ java -Xmx16g edu.stanford.nlp.pipeline.StanfordCoreNLP \
-annotators tokenize,ssplit,pos,lemma,ner \
-ner.additional.regexner.mapping custom_entities.tsv \
-file input_sentences.txt \
-outputFormat text; \
cat input_sentences.txt.out
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.5 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [0.9 sec].
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.5 sec].
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [0.4 sec].
[main] INFO edu.stanford.nlp.time.JollyDayHolidays - Initializing JollyDayHoliday for SUTime from classpath edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml as sutime.binder.1.
[main] INFO edu.stanford.nlp.time.TimeExpressionExtractorImpl - Using following SUTime rules: edu/stanford/nlp/models/sutime/defs.sutime.txt,edu/stanford/nlp/models/sutime/english.sutime.txt,edu/stanford/nlp/models/sutime/english.holidays.sutime.txt
[main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 580705 unique entries out of 581864 from edu/stanford/nlp/models/kbp/english/gazetteers/regexner_caseless.tab, 0 TokensRegex patterns.
[main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 4869 unique entries out of 4869 from edu/stanford/nlp/models/kbp/english/gazetteers/regexner_cased.tab, 0 TokensRegex patterns.
[main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 585574 unique entries from 2 files
[main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.additional.regexner: Read 5 unique entries out of 5 from custom_entities.tsv, 0 TokensRegex patterns.
Processing file /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt ... writing to /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt.out
Annotating file /mnt/Vancouver/apps/CoreNLP/target/input_sentences.txt ... done [0.4 sec].
Annotation pipeline timing information:
TokenizerAnnotator: 0.1 sec.
WordsToSentencesAnnotator: 0.0 sec.
POSTaggerAnnotator: 0.0 sec.
MorphaAnnotator: 0.0 sec.
NERCombinerAnnotator: 0.2 sec.
TOTAL: 0.4 sec. for 20 tokens at 54.8 tokens/sec.
Pipeline setup: 8.5 sec.
Total time for StanfordCoreNLP pipeline: 9.1 sec.
Document: ID=input_sentences.txt (3 sentences, 20 tokens)
Sentence #1 (7 tokens):
Victoria lives in Vancouver, Canada.
Tokens:
[Text=Victoria CharacterOffsetBegin=0 CharacterOffsetEnd=8 PartOfSpeech=NNP Lemma=Victoria NamedEntityTag=PERSON]
[Text=lives CharacterOffsetBegin=9 CharacterOffsetEnd=14 PartOfSpeech=VBZ Lemma=live NamedEntityTag=O]
[Text=in CharacterOffsetBegin=15 CharacterOffsetEnd=17 PartOfSpeech=IN Lemma=in NamedEntityTag=O]
[Text=Vancouver CharacterOffsetBegin=18 CharacterOffsetEnd=27 PartOfSpeech=NNP Lemma=Vancouver NamedEntityTag=CITY]
[Text=, CharacterOffsetBegin=27 CharacterOffsetEnd=28 PartOfSpeech=, Lemma=, NamedEntityTag=O]
[Text=Canada CharacterOffsetBegin=29 CharacterOffsetEnd=35 PartOfSpeech=NNP Lemma=Canada NamedEntityTag=COUNTRY]
[Text=. CharacterOffsetBegin=35 CharacterOffsetEnd=36 PartOfSpeech=. Lemma=. NamedEntityTag=O]
Extracted the following NER entity mentions:
Victoria PERSON LOCATION:0.6059370876590606
Vancouver CITY LOCATION:0.9921788688695864
Canada COUNTRY LOCATION:0.9992413208111567
Sentence #2 (7 tokens):
She was born in Nova Scotia.
Tokens:
[Text=She CharacterOffsetBegin=37 CharacterOffsetEnd=40 PartOfSpeech=PRP Lemma=she NamedEntityTag=O]
[Text=was CharacterOffsetBegin=41 CharacterOffsetEnd=44 PartOfSpeech=VBD Lemma=be NamedEntityTag=O]
[Text=born CharacterOffsetBegin=45 CharacterOffsetEnd=49 PartOfSpeech=VBN Lemma=bear NamedEntityTag=O]
[Text=in CharacterOffsetBegin=50 CharacterOffsetEnd=52 PartOfSpeech=IN Lemma=in NamedEntityTag=O]
[Text=Nova CharacterOffsetBegin=53 CharacterOffsetEnd=57 PartOfSpeech=NNP Lemma=Nova NamedEntityTag=STATE_OR_PROVINCE]
[Text=Scotia CharacterOffsetBegin=58 CharacterOffsetEnd=64 PartOfSpeech=NNP Lemma=Scotia NamedEntityTag=STATE_OR_PROVINCE]
[Text=. CharacterOffsetBegin=64 CharacterOffsetEnd=65 PartOfSpeech=. Lemma=. NamedEntityTag=O]
Extracted the following NER entity mentions:
Nova Scotia STATE_OR_PROVINCE LOCATION:0.9944154320168771
She PERSON -
Sentence #3 (6 tokens):
Victoria likes apples and bananas.
Tokens:
[Text=Victoria CharacterOffsetBegin=66 CharacterOffsetEnd=74 PartOfSpeech=NNP Lemma=Victoria NamedEntityTag=PERSON]
[Text=likes CharacterOffsetBegin=75 CharacterOffsetEnd=80 PartOfSpeech=VBZ Lemma=like NamedEntityTag=O]
[Text=apples CharacterOffsetBegin=81 CharacterOffsetEnd=87 PartOfSpeech=NNS Lemma=apple NamedEntityTag=FRUIT]
[Text=and CharacterOffsetBegin=88 CharacterOffsetEnd=91 PartOfSpeech=CC Lemma=and NamedEntityTag=O]
[Text=bananas CharacterOffsetBegin=92 CharacterOffsetEnd=99 PartOfSpeech=NNS Lemma=banana NamedEntityTag=FRUIT]
[Text=. CharacterOffsetBegin=99 CharacterOffsetEnd=100 PartOfSpeech=. Lemma=. NamedEntityTag=O]
Extracted the following NER entity mentions:
Victoria PERSON PERSON:0.5045879288466439
apples FRUIT -
bananas FRUIT -
$
Ok how about we try these commands and see if that works.
The first sets the CLASSPATH
environment variable, the next is just for showing that worked, then since it appears the relevant files are in /mnt/Vancouver/apps/CoreNLP/target
you should cd
into that directory and run the java
command.
Assuming /mnt/Vancouver/apps/CoreNLP/stanford-corenlp-full/stanford-corenlp-full-2018-10-05
is an unaltered download of the 3.9.2 distribution folder, things should work properly.
Please let me know if there are any issues and I can help you troubleshoot more.
export CLASSPATH=/mnt/Vancouver/apps/CoreNLP/stanford-corenlp-full/stanford-corenlp-full-2018-10-05/*:
echo $CLASSPATH
cd /mnt/Vancouver/apps/CoreNLP/target
java -Xmx5g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.additional.regexner.mapping custom_entities.tsv -file input_sentences.txt -outputFormat text
Oh wait, sorry, I guess it looks like you've got it working!
Any rate for the time being I would recommend working with the official 3.9.2 release, since master
of Stanford CoreNLP is a bit messy...we are going to release 4.0.0 over the next few weeks.
Yes: working now! I'll mark this Issue as closed.
Thank you once again, @J38 , for your patient help -- very much appreciated! :+1:
Edit: added to ~/.bashrc
:
## https://stanfordnlp.github.io/CoreNLP/download.html#steps-to-setup-from-the-github-head-version
## Since the following lines will duplicate / add all of the $CLASSPATH information
## every time I `exec bash` the terminal, I first explicitly clear that PATH.
## Alternatively, add to `~/.profile` as described here:
## https://stackoverflow.com/questions/13830594/when-i-execute-bash-the-path-keeps-repeating-itself
export CLASSPATH=""
export CLASSPATH="$CLASSPATH:/mnt/Vancouver/apps/CoreNLP/target/stanford-corenlp-3.9.2.jar:/mnt/Vancouver/apps/CoreNLP/models/stanford-corenlp-models-current.jar:/mnt/Vancouver/apps/CoreNLP/models/stanford-english-corenlp-models-current.jar:/mnt/Vancouver/apps/CoreNLP/models/stanford-english-kbp-corenlp-models-current.jar";
for file in `find /mnt/Vancouver/apps/CoreNLP/lib/ -name "*.jar"`; do export CLASSPATH="$CLASSPATH:`realpath $file`"; done
Update / Solution: it was a
$CLASSPATH
issue. Users reading this can skip most of the content below; here are the key comments.https://github.com/stanfordnlp/CoreNLP/issues/983#issuecomment-573462561 [@J38 comment] https://github.com/stanfordnlp/CoreNLP/issues/983#issuecomment-573536392 [@victoriastuart actual issue | solution] https://github.com/stanfordnlp/CoreNLP/issues/983#issuecomment-573537834 [@victoriastuart comment re: appending
$CLASSPATH
to~/.bashrc
or~/.profile
]I'm going to echo @DaveQuinn29 's concern [#910] that there is a cache and/or some other issue in CoreNLP. Issues involved include:
tokenize | tokenize, ssplit | ...
)-regexner.mapping
is used in conjunction withner, regexner
annotators, the RegexNER mappings (in a TSV) are ignored-regexner.mapping
is used in conjunction withregexner
annotator (ner
is excluded), the RegexNER mappings (in a TSV) are appliedI had been trying to add custom NER tagging via a custom RegexNER file, like I used with the JAVA version a couple of years ago (more recently in Python via stanfordnlp). I can't get RegexNER to work in Python, so I returned to the JAVA implementation of CoreNLP -- from the command line -- to troubleshoot )and work from there, if needed).
However, once again I have not had much success in NER tagging with CoreNLP's trained models plus my custom RegexNER TSV file, formatted as described at https://nlp.stanford.edu/software/regexner.html
I've tried various permutations of
'tokenize, ssplit, pos, lemma, ner, regexner
(always in that relative order) with-regexner.mapping
|-ner.additional.regexner.mapping
... and I cannot simultaneously NER tag text with the default CoreNLP libraries plus my own custom NER tags.I expect the
ner,regexner
annotators combination with-regexner.mapping
to co-tag with CoreNLP tags, superseded by custom NER tags. Is that true, or am I mistaken?[I haven't looked at the
rules
approach, https://stanfordnlp.github.io/CoreNLP/ner.html#regexner-rules-format.While I can get either NER (the default statistical model + ... fine NER rules added via the regexner annotator) or
-regexner.mapping
(with my custom tokens file) to work, it's always either one or the other. And before I'm directed there (cough: @J38), I've certainly looked at the https://stanfordnlp.github.io/CoreNLP/ner.html page to which we are so often referred.Furthermore, in evaluating various permutations of annotators, I've found upon stepwise additions adding the
lemma
annotator is particularly troublesome, immediately breaking RegexNER. And, when I try to step back, old annotator settings are retained (cached?). For example, I get lemmatization, a dependency parse, etc. in the output even if those annotators are not included in the annotators argument list.It appears that whenever CoreNLP encounters an error, it silently loads the defaults, so that user-defined settings are ignored.
Similar issues / concerns have been raised elsewhere.
If I slowly add annotators one at a time, I can (sort of: not consistently) "reset" CoreNLP:
Adding
ner
annotator breaks RegexNER tagging (above):"Cache" (?) issue -- no classpath, etc. given yet outputs previous result (obfuscating debugging attempts, by the way):