stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.21k stars 885 forks source link

issue with the output for simplified chinese language #55

Closed feng-1985 closed 5 years ago

feng-1985 commented 5 years ago
from stanfordnlp.server import CoreNLPClient

text = '这是个最好的时代,也是一个最坏的时代!'  

properties = {
        # segment
        "tokenize.language": "zh",
        "segment.model": "edu/stanford/nlp/models/segmenter/chinese/ctb.gz",
         ...

with CoreNLPClient(properties=properties, annotators=annotators,timeout=60000, threads=5, memory='4G', be_quiet=False) as client: 
    print('---')
    print('first token of first sentence')
    token = sentence.token[0]
    print(token)
    ...

The output: first token of first sentence word: "\350\277\231" pos: "PN" value: "\350\277\231" originalText: "\350\277\231" ner: "O" lemma: "\350\277\231" beginChar: 0 endChar: 1

yuhaozhang commented 5 years ago

If you print out each value individually, the result should look right. Try the following:

with CoreNLPClient(properties=properties, annotators=annotators,timeout=60000, threads=5, memory='4G', be_quiet=False) as client: 
    print('---')
    print('first token of first sentence')
    token = sentence.token[0]
    print(token.word)
    print(token.originalText)
    print(token.lemma)
    ...
feng-1985 commented 5 years ago

Response so quickly! Thanks! You are right!