stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.27k stars 891 forks source link

Using the corenlp client with different languages #12

Closed shyamupa closed 5 years ago

shyamupa commented 5 years ago

I want to access the arabic and chinese models from the Java CoreNLP using the CoreNLP client. The README says that to use the Java models, I should put the models in the "distribution folder".

  1. Is this the same folder as CORENLP_HOME?

  2. How to specify what language model to use when creating the CoreNLP client?

shyamupa commented 5 years ago

So figured it out after a while.

  1. The folder is same as CORENLP_HOME

  2. This is how to specify a certain language's properties.

properties = {
            # segment
            "tokenize.language": "zh",
            "segment.model": "edu/stanford/nlp/models/segmenter/chinese/ctb.gz",
            "segment.sighanCorporaDict": "edu/stanford/nlp/models/segmenter/chinese",
            "segment.serDictionary": "edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz",
            "segment.sighanPostProcessing": "true",
            # sentence split
            "ssplit.boundaryTokenRegex": "[.。]|[!?!?]+",
            # pos
            "pos.model": "edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger",
            # ner
            "ner.language": "chinese",
            "ner.model": "edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz",
            "ner.applyNumericClassifiers": "true",
            "ner.useSUTime": "false",
            # regexner
            "ner.fine.regexner.mapping": "edu/stanford/nlp/models/kbp/chinese/gazetteers/cn_regexner_mapping.tab",
            "ner.fine.regexner.noDefaultOverwriteLabels": "CITY,COUNTRY,STATE_OR_PROVINCE"
        }
        annotators = ['tokenize', 'ssplit', 'pos', 'lemma', 'ner']
        # set up the client
        self.corenlp_client = CoreNLPClient(properties=properties,
                                            annotators=annotators,
                                            timeout=60000, memory='16G',
                                            output_format="json",
                                            be_quiet=False)
1049451037 commented 5 years ago

However, I get some exceptions:

---
input text

達沃斯世界經濟論壇是每年全球政商界領袖聚在一起的年度盛事。
---
starting up Java Stanford CoreNLP Server...
[main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---
[main] INFO CoreNLP - setting default constituency parser
[main] INFO CoreNLP - warning: cannot find edu/stanford/nlp/models/srparser/englishSR.ser.gz
[main] INFO CoreNLP - using: edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz instead
[main] INFO CoreNLP - to use shift reduce parser download English models jar from:
[main] INFO CoreNLP - http://stanfordnlp.github.io/CoreNLP/download.html
[main] INFO CoreNLP -     Threads: 5
[main] INFO CoreNLP - Starting server...
[main] INFO CoreNLP - StanfordCoreNLPServer listening at /0:0:0:0:0:0:0:0:9000
[pool-1-thread-3] INFO CoreNLP - [/0:0:0:0:0:0:0:1:40942] API call w/annotators tokenize,ssplit,pos,lemma,ner,depparse,coref
達沃斯世界經濟論壇是每年全球政商界領袖聚在一起的年度盛事。
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
java.lang.RuntimeException: java.io.IOException: Unable to open "edu/stanford/nlp/models/segmenter/chinese/ctb.gz" as class path, filename or URL
    at edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator.<init>(ChineseSegmenterAnnotator.java:104)
    at edu.stanford.nlp.pipeline.TokenizerAnnotator.<init>(TokenizerAnnotator.java:208)
    at edu.stanford.nlp.pipeline.TokenizerAnnotator.<init>(TokenizerAnnotator.java:166)
    at edu.stanford.nlp.pipeline.AnnotatorImplementations.tokenizer(AnnotatorImplementations.java:31)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$getNamedAnnotators$0(StanfordCoreNLP.java:518)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$null$30(StanfordCoreNLP.java:602)
    at edu.stanford.nlp.util.Lazy$3.compute(Lazy.java:126)
    at edu.stanford.nlp.util.Lazy.get(Lazy.java:31)
    at edu.stanford.nlp.pipeline.AnnotatorPool.get(AnnotatorPool.java:149)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:251)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:192)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:188)
    at edu.stanford.nlp.pipeline.StanfordCoreNLPServer.mkStanfordCoreNLP(StanfordCoreNLPServer.java:368)
    at edu.stanford.nlp.pipeline.StanfordCoreNLPServer.access$800(StanfordCoreNLPServer.java:50)
    at edu.stanford.nlp.pipeline.StanfordCoreNLPServer$CoreNLPHandler.handle(StanfordCoreNLPServer.java:855)
    at com.sun.net.httpserver.Filter$Chain.doFilter(jdk.httpserver@9-internal/Filter.java:77)
    at sun.net.httpserver.AuthFilter.doFilter(jdk.httpserver@9-internal/AuthFilter.java:82)
    at com.sun.net.httpserver.Filter$Chain.doFilter(jdk.httpserver@9-internal/Filter.java:80)
    at sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(jdk.httpserver@9-internal/ServerImpl.java:685)
    at com.sun.net.httpserver.Filter$Chain.doFilter(jdk.httpserver@9-internal/Filter.java:77)
    at sun.net.httpserver.ServerImpl$Exchange.run(jdk.httpserver@9-internal/ServerImpl.java:657)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@9-internal/ThreadPoolExecutor.java:1158)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@9-internal/ThreadPoolExecutor.java:632)
    at java.lang.Thread.run(java.base@9-internal/Thread.java:804)
Caused by: java.io.IOException: Unable to open "edu/stanford/nlp/models/segmenter/chinese/ctb.gz" as class path, filename or URL
    at edu.stanford.nlp.io.IOUtils.getInputStreamFromURLOrClasspathOrFileSystem(IOUtils.java:480)
    at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1503)
    at edu.stanford.nlp.ie.crf.CRFClassifier.getClassifier(CRFClassifier.java:2939)
    at edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator.<init>(ChineseSegmenterAnnotator.java:100)
    ... 23 more
Traceback (most recent call last):
  File "/home/qingsong/anaconda3/lib/python3.6/site-packages/stanfordnlp/server/client.py", line 193, in _request
    r.raise_for_status()
  File "/home/qingsong/anaconda3/lib/python3.6/site-packages/requests/models.py", line 935, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://localhost:9000/?properties=%7B%27tokenize.language%27%3A+%27zh%27%2C+%27segment.model%27%3A+%27edu%2Fstanford%2Fnlp%2Fmodels%2Fsegmenter%2Fchinese%2Fctb.gz%27%2C+%27segment.sighanCorporaDict%27%3A+%27edu%2Fstanford%2Fnlp%2Fmodels%2Fsegmenter%2Fchinese%27%2C+%27segment.serDictionary%27%3A+%27edu%2Fstanford%2Fnlp%2Fmodels%2Fsegmenter%2Fchinese%2Fdict-chris6.ser.gz%27%2C+%27segment.sighanPostProcessing%27%3A+%27true%27%2C+%27ssplit.boundaryTokenRegex%27%3A+%27%5B.%E3%80%82%5D%7C%5B%21%3F%EF%BC%81%EF%BC%9F%5D%2B%27%2C+%27pos.model%27%3A+%27edu%2Fstanford%2Fnlp%2Fmodels%2Fpos-tagger%2Fchinese-distsim%2Fchinese-distsim.tagger%27%2C+%27ner.language%27%3A+%27chinese%27%2C+%27ner.model%27%3A+%27edu%2Fstanford%2Fnlp%2Fmodels%2Fner%2Fchinese.misc.distsim.crf.ser.gz%27%2C+%27ner.applyNumericClassifiers%27%3A+%27true%27%2C+%27ner.useSUTime%27%3A+%27false%27%2C+%27ner.fine.regexner.mapping%27%3A+%27edu%2Fstanford%2Fnlp%2Fmodels%2Fkbp%2Fchinese%2Fgazetteers%2Fcn_regexner_mapping.tab%27%2C+%27ner.fine.regexner.noDefaultOverwriteLabels%27%3A+%27CITY%2CCOUNTRY%2CSTATE_OR_PROVINCE%27%2C+%27annotators%27%3A+%27tokenize%2Cssplit%2Cpos%2Clemma%2Cner%2Cdepparse%2Ccoref%27%2C+%27inputFormat%27%3A+%27text%27%2C+%27outputFormat%27%3A+%27json%27%2C+%27serializer%27%3A+%27edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer%27%7D

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "corenlp.py", line 39, in <module>
    ann = client.annotate(text)
  File "/home/qingsong/anaconda3/lib/python3.6/site-packages/stanfordnlp/server/client.py", line 225, in annotate
    r = self._request(text.encode('utf-8'), properties)
  File "/home/qingsong/anaconda3/lib/python3.6/site-packages/stanfordnlp/server/client.py", line 199, in _request
    raise AnnotationException(r.text)
stanfordnlp.server.client.AnnotationException: java.lang.RuntimeException: java.io.IOException: Unable to open "edu/stanford/nlp/models/segmenter/chinese/ctb.gz" as class path, filename or URL
yuhaozhang commented 5 years ago

Hi @1049451037, can you provide more details on how you specified your annotators and properties?

Based on the error log, it does seem like you forgot to download the CoreNLP Chinese model files. Note that the default CoreNLP package only comes with English models, and you have to download models for other languages such as Chinese separately. Just put the Chinese jar file into the directory where your package is (i.e., $CORENLP_HOME) and try again.

yuhaozhang commented 5 years ago

Hi @shyamupa, sorry that your initial issue was missed somehow. Glad that you figured things out! Meanwhile we did improve the documentation on how to use different options for the Python CoreNLP client a bit (link here). However for the detailed properties, you do need to look at the corresponding annotators on the CoreNLP website. Let us know if you have more issues!

1049451037 commented 5 years ago

@yuhaozhang Thank you, I downloaded the model and it works well now.

1049451037 commented 5 years ago

@bifeng , have you setup CORENLP_HOME following this link?

feng-1985 commented 5 years ago

thanks for response! It is the Chinese jar file broken. After update this file and it works!

lfzhagn commented 5 years ago

@shyamupa Hi, Could you please tell me how to get the keys and values in the properties dictionary? For example, "segment.model": "edu/stanford/nlp/models/segmenter/chinese/ctb.gz". Are there any document about these parameters? In other words, how can I know there are such keys in the dictionary?

shyamupa commented 5 years ago

Search the properties file (named something like StanfordCoreNLP-chinese.properties). This is usually inside the jar, so you might need to unpack it. It contains the keys and values that make up the dictionary above.

# segment
tokenize.language = zh
segment.model = edu/stanford/nlp/models/segmenter/chinese/ctb.gz
segment.sighanCorporaDict = edu/stanford/nlp/models/segmenter/chinese
segment.serDictionary = edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
segment.sighanPostProcessing = true
lfzhagn commented 5 years ago

@shyamupa Wow! Thank you very much for your fantastic answer and fast reply! 👍