tsroten / pynlpir

A Python wrapper around the NLPIR/ICTCLAS Chinese segmentation software.
MIT License
566 stars 135 forks source link

I tried as exactly as what tutorial says, but the result.... #85

Closed evanhasnoclue closed 6 years ago

evanhasnoclue commented 7 years ago

Here are what I put and get.... I tried all three encoding types but they all didn't work... I was so confused about the encoding thing...... Please help me out!

import pynlpir pynlpir.open() s= 'NLPIR分词系统前身为2000年发布的ICTCLAS词法分析系统,从2009年开始,为了和以前工作进行大的区隔,并推广NLPIR自然语 言处理与信息检索共享平台,调整命名为NLPIR分词系统。' pynlpir.segment(s) Traceback (most recent call last): File "", line 1, in File "C:\Python27\lib\site-packages\pynlpir__init__.py", line 232, in segment s = _decode(s) File "C:\Python27\lib\site-packages\pynlpir__init__.py", line 164, in _decode return s if isinstance(s, unicode) else s.decode(encoding, errors) File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xb7 in position 5: invalid start byte

tsroten commented 7 years ago

You probably have a different default encoding in your terminal. Try putting a u in front of your string:

s = u'NLPIR分词系统前身为2000年发布的ICTCLAS词法分析系统,从2009年开始,为了和以前工作进行大的区隔,并推广NLPIR自然语 言处理与信息检索共享平台,调整命名为NLPIR分词系统。'
evanhasnoclue commented 7 years ago

@tsroten Thanks. It works! But my result is all made of codes... How can I do to make it into Chinese characters? I've tried s.encode('utf8') 'gbk' so much, but didn't work. Thanks for helping me out!

pynlpir.segment(s) [(u'NLPIR', u'noun'), (u'\u5206\u8bcd', u'verb'), (u'\u7cfb\u7edf', u'noun'), (u'\u524d\u8eab', u'noun'), (u'\u4e3a', u'preposition'), (u'2000\u5e74', u'time word'), (u'\u53d1\u5e03', u'verb'), (u'\u7684', u'particle'), (u'ICTCLAS', u'noun'), (u'\u8bcd\u6cd5', u'noun'), (u'\u5206\u6790', u'verb'), (u'\u7cfb\u7edf', u'noun'), (u'\uff0c', u'punctuation mark'), (u'\u4ece', u'preposition'), (u'2009\u5e74', u'time word'), (u'\u5f00\u59cb', u'verb'), (u'\uff0c', u'punctuation mark'), (u'\u4e3a\u4e86', u'preposition'), (u'\u548c', u'conjunction'), (u'\u4ee5\u524d', u'noun of locality'), (u'\u5de5\u4f5c', u'verb'), (u'\u8fdb\u884c', u'verb'), (u'\u5927', u'adjective'), (u'\u7684', u'particle'), (u'\u533a', u'noun'), (u'\u9694', u'verb'), (u'\uff0c', u'punctuation mark'), (u'\u5e76', u'conjunction'), (u'\u63a8\u5e7f', u'verb'), (u'NLPIR', u'noun'), (u'\u81ea\u7136', u'noun'), (u'\u8bed', u'noun'), (u' ', None), (u'\u8a00', u'verb'), (u'\u5904\u7406', u'verb'), (u'\u4e0e', u'preposition'), (u'\u4fe1\u606f', u'noun'), (u'\u68c0\u7d22', u'verb'), (u'\u5171\u4eab', u'verb'), (u'\u5e73\u53f0', u'noun'), (u'\uff0c', u'punctuation mark'), (u'\u8c03\u6574', u'verb'), (u'\u547d\u540d', u'verb'), (u'\u4e3a', u'verb'), (u'NLPIR', u'noun'), (u'\u5206\u8bcd', u'verb'), (u'\u7cfb\u7edf', u'noun'), (u'\u3002', u'punctuation mark')]

tsroten commented 7 years ago

The strings in the response are Unicode, so you'll want to print them if you want to read them.

segments = pynlpir.segment(s)
for word, part_of_speech in segments:
    print(word)