tsroten / pynlpir

A Python wrapper around the NLPIR/ICTCLAS Chinese segmentation software.
MIT License
566 stars 135 forks source link

wrong segment result after added words by "AddUserWord" #83

Closed zhubinjun closed 6 years ago

zhubinjun commented 7 years ago

After I added the user words, I get the wrong segment result:

Test Code:

import pynlpir  
pynlpir.open()  
print(pynlpir.segment('双肺'))  
pynlpir.nlpir.AddUserWord( '双肺'.encode('utf8') )
print(pynlpir.segment('双肺'))
print(pynlpir.segment('双肺'*2))
print(pynlpir.segment('立普妥'))
pynlpir.nlpir.AddUserWord( '立普妥'.encode('utf8') )
print(pynlpir.segment('立普妥'))
print(pynlpir.segment('立普妥'*2))
print(pynlpir.segment('南太平洋'))
pynlpir.nlpir.AddUserWord( '南太平洋'.encode('utf8') )
print(pynlpir.segment('南太平洋'))
print(pynlpir.segment('南太平洋'*2))

output: (python 3 WIndows /Ubuntu ):

[('双', 'numeral'), ('肺', 'noun')]
[('双', 'noun')]
[('双肺', 'noun'), ('双', 'noun')]
[('立', 'verb'), ('普', 'adjective'), ('妥', 'adjective')]
[('立', 'noun')]
[('立普妥', 'noun'), ('立', 'noun')]
[('南', 'distinguishing word'), ('太平洋', 'noun')]
[('南', 'noun')]
[('南太平洋', 'noun'), ('南', 'noun')]
tsroten commented 7 years ago

@zhubinjun Thanks for using PyNLPIR!

PyNLPIR does not actually segment the text, it simply outputs whatever NLPIR returns. I'd recommend asking around at NLPIR:

My guess is that NLPIR works more reliably with actual sentences, not one-word strings. For example:

>>> pynlpir.segment('随着20世纪70年代环孢素A的问世和移植技术的进步,1981年美国斯坦福大学医院首先获得心肺联合移植的成功;1983年和1986年加拿大多伦多肺移植组又相继成功地施行了单肺移植和双肺移植,开创了肺移
植的新纪元。'))
[... ('双', 'numeral'), ('肺', 'noun'), ...]
>>> pynlpir.nlpir.AddUserWord( '双肺'.encode('utf8') )
>>> print(pynlpir.segment('随着20世纪70年代环孢素A的问世和移植技术的进步,1981年美国斯坦福大学医院首先获得心肺联合移植的成功;1983年和1986年加拿大多伦多肺移植组又相继成功地施行了单肺移植和双肺移植,开创了肺移
植的新纪元。'))
[... ('双肺', 'noun') ...]

Text from https://baike.baidu.com/item/%E5%8F%8C%E8%82%BA%E7%A7%BB%E6%A4%8D