stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
http://stanfordnlp.github.io/CoreNLP/
GNU General Public License v3.0
9.58k stars 2.7k forks source link

Correct path (remove gazetteers) for DEFAULT_KBP_TOKENSREGEX_DIR in DefaultPaths class #739

Open datancoffee opened 6 years ago

datancoffee commented 6 years ago

When building the jar from GitHub head following the instructions in https://stanfordnlp.github.io/CoreNLP/download.html, the resulting code fails to load the NER models because of the extra "gazetteers" in

public static final String DEFAULT_KBP_REGEXNER_CASELESS = "edu/stanford/nlp/models/kbp/english/gazetteers/regexner_caseless.tab";

Steps:

  1. I built the javanlp-core.jar from HEAD

ant jar

  1. I downloaded all model files and added them to the CLASSPATH, but it still would not help

export CLASSPATH="$CLASSPATH:/pathto/corenlp/javanlp-core.jar:/pathto/corenlp/stanford-corenlp-3.9.1-models.jar:/pathto/corenlp/stanford-corenlp-3.9.1-models-english.jar:/pathto/corenlp/stanford-corenlp-3.9.1-models-english-kbp.jar";

  1. Only when I downloaded the stanford-corenlp-3.9.1.jar and replaced javanlp-core.jar with it, was I able to successfully run

java -mx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat json -file input.txt

(actually, I had to increase memory to 5g from 3g - 3g is not enough; you might want to change these instructions as well)

J38 commented 6 years ago

Hi did you try this with the latest models jars from the GitHub front page?

https://github.com/stanfordnlp/CoreNLP

When I look at the most current models jars we have out, they have the new file paths for the regex rules files.

Make sure not to use the 3.9.1 jars if using the code from GitHub, those are now out of date for the latest code. We are going to release 3.9.2 fairly soon!