stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
http://stanfordnlp.github.io/CoreNLP/
GNU General Public License v3.0
9.63k stars 2.7k forks source link

Unable to open "/home/john/extern_data/corenlp-segmenter/dict-chris6.ser.gz" #1458

Closed WXCMYDEARMELAN closed 1 month ago

WXCMYDEARMELAN commented 1 month ago

Exception: image

I never specified the path in the code: /home/john/extern_data/corenlp-segmenter/dict-chris6.ser.gz. image

AngledLuffa commented 1 month ago

Not sure how to help here considering I don't know what you downloaded or ran

Please send text instead of images

WXCMYDEARMELAN commented 1 month ago

Not sure how to help here considering I don't know what you downloaded or ran Please send text instead of images

please check code: Properties props = new Properties(); props.setProperty("annotators", "tokenize, ssplit"); props.setProperty("tokenize.language", "zh"); props.setProperty("segment.model", "edu/stanford/nlp/models/segmenter/chinese/ctb.gz"); props.setProperty("segment.dictionary", "edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz"); StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

log: 11:39:53.747 [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize 11:40:02.958 [main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/segmenter/chinese/ctb.gz ... done [9.2 sec]. 11:40:02.988 [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit 11:40:03.202 [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Loading Chinese dictionaries from 1 file: 11:40:03.202 [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz 11:40:03.548 [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Done. Unique words in ChineseDictionary is: 423200. 11:40:03.548 [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Loading Chinese dictionaries from 1 file: 11:40:03.548 [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - /home/john/extern_data/corenlp-segmenter/dict-chris6.ser.gz 11:40:03.551 [main] ERROR edu.stanford.nlp.wordseg.ChineseDictionary - java.io.IOException: Unable to open "/home/john/extern_data/corenlp-segmenter/dict-chris6.ser.gz" as class path, filename or URL

Exception: Exception in thread "main" java.lang.RuntimeException: java.io.IOException: Unable to open "/home/john/extern_data/corenlp-segmenter/dict-chris6.ser.gz" as class path, filename or URL

maven:

 <dependency>
        <groupId>edu.stanford.nlp</groupId>
        <artifactId>stanford-corenlp</artifactId>
        <version>4.2.2</version>
        <classifier>models-chinese</classifier>
        <exclusions>
            <exclusion>
                <groupId>com.google.protobuf</groupId>
                <artifactId>protobuf-java</artifactId>
            </exclusion>
        </exclusions>
    </dependency>

    <dependency>
        <groupId>edu.stanford.nlp</groupId>
        <artifactId>stanford-corenlp</artifactId>
        <version>4.2.2</version>
        <exclusions>
            <exclusion>
                <groupId>com.google.protobuf</groupId>
                <artifactId>protobuf-java</artifactId>
            </exclusion>
        </exclusions>
    </dependency>
AngledLuffa commented 1 month ago

There's a much newer version available. Can I recommend upgrading?

https://mvnrepository.com/artifact/edu.stanford.nlp/stanford-corenlp

WXCMYDEARMELAN commented 1 month ago

There's a much newer version available. Can I recommend upgrading? https://mvnrepository.com/artifact/edu.stanford.nlp/stanford-corenlp

yeah I upgraded the jar to 4.5.5, but the result is the same

   <dependency>
        <groupId>edu.stanford.nlp</groupId>
        <artifactId>stanford-corenlp</artifactId>
        <version>4.5.5</version>
        <classifier>models-chinese</classifier>
    </dependency>

 <dependency>
        <groupId>edu.stanford.nlp</groupId>
        <artifactId>stanford-corenlp</artifactId>
        <version>4.5.5</version>
        <exclusions>
            <exclusion>
                <groupId>com.google.protobuf</groupId>
                <artifactId>protobuf-java</artifactId>
            </exclusion>
        </exclusions>
    </dependency>

14:29:29.439 [main] DEBUG edu.stanford.nlp.pipeline.StanfordCoreNLP - ssplit is now included as part of the tokenize annotator by default 14:29:29.442 [main] DEBUG edu.stanford.nlp.pipeline.StanfordCoreNLP - Updating annotators from tokenize, ssplit to tokenize 14:29:29.456 [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize 14:29:37.666 [main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/segmenter/chinese/ctb.gz ... done [8.2 sec]. 14:29:37.704 [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Loading Chinese dictionaries from 1 file: 14:29:37.704 [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz 14:29:37.909 [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Done. Unique words in ChineseDictionary is: 423200. 14:29:37.909 [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Loading Chinese dictionaries from 1 file: 14:29:37.909 [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - /home/john/extern_data/corenlp-segmenter/dict-chris6.ser.gz 14:29:37.911 [main] ERROR edu.stanford.nlp.wordseg.ChineseDictionary - java.io.IOException: Unable to open "/home/john/extern_data/corenlp-segmenter/dict-chris6.ser.gz" as class path, filename or URL edu.stanford.nlp.io.IOUtils.getInputStreamFromURLOrClasspathOrFileSystem(IOUtils.java:501) edu.stanford.nlp.io.IOUtils.readObjectFromURLOrClasspathOrFileSystem(IOUtils.java:309) edu.stanford.nlp.wordseg.ChineseDictionary.loadDictionary(ChineseDictionary.java:69)

AngledLuffa commented 1 month ago

Okay, I understand the problem. The model was built with some various defaults, including paths to /home/john. The Chinese Pipeline uses some various flags to set those to the new locations of those files in the jar resources we distribute. You can see the paths for the segmenter model in the StanfordCoreNLP-chinese.properties file, copied here for your convenience:

tokenize.language = zh
segment.model = edu/stanford/nlp/models/segmenter/chinese/ctb.gz
segment.sighanCorporaDict = edu/stanford/nlp/models/segmenter/chinese
segment.serDictionary = edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
segment.sighanPostProcessing = true

ssplit.boundaryTokenRegex = [.。]|[!?!?]+

If you're creating a Pipeline by hand using Properties, instead of reusing that properties file, you'll want to set those properties as well as the ones you've already set above.

AngledLuffa commented 1 month ago

ps that should also work for the older 4.2.2, assuming there's a reason you wanted to use that version, but I do recommend updating. Every once in a while we fix a bug relevant to one of those models

WXCMYDEARMELAN commented 1 month ago

Okay, I understand the problem. The model was built with some various defaults, including paths to /home/john. The Chinese Pipeline uses some various flags to set those to the new locations of those files in the jar resources we distribute. You can see the paths for the segmenter model in the StanfordCoreNLP-chinese.properties file, copied here for your convenience:

tokenize.language = zh
segment.model = edu/stanford/nlp/models/segmenter/chinese/ctb.gz
segment.sighanCorporaDict = edu/stanford/nlp/models/segmenter/chinese
segment.serDictionary = edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
segment.sighanPostProcessing = true

ssplit.boundaryTokenRegex = [.。]|[!?!?]+

If you're creating a Pipeline by hand using Properties, instead of reusing that properties file, you'll want to set those properties as well as the ones you've already set above.

It's useful, thank you very much