tsroten / pynlpir

A Python wrapper around the NLPIR/ICTCLAS Chinese segmentation software.
MIT License
566 stars 135 forks source link

BUG: Missing a word in the result of segmentation #124

Closed Christiannov closed 5 years ago

Christiannov commented 5 years ago

Hello, I found that pynlpir has a bug when dealing with the sentence "本报编辑部评出2000年国内十大新闻". The following code block shows that we missed the word "新闻" in the segmentation result of the sentence.

import pynlpir
pynlpir.open() 
s = "本报编辑部评出2000年国内十大新闻"
pynlpir.segment(s)
[('本报', 'pronoun'), ('编辑部', 'noun'), ('评', 'verb'), ('出', 'verb'), ('2000年', 'time word'), ('国内', 'locative word'), ('十', 'numeral'), ('大', 'noun')]

I don't understand why this bug occurs. Do you have any idea? Thanks!

tsroten commented 5 years ago

Hello @Christiannov! So, PyNLPIR doesn't actually handle the segmentation itself, it uses NLPIR behind the scenes. I've found that NLPIR is pretty picky with regards to grammar. In many cases it appears to leave out words if the sentence is missing punctuation marks.

Try adding a period (full stop) at the end of your string and see if that helps.

I'd also suggest reporting this issue to NLPIR. You could try their website/forum or their GitHub page.

Christiannov commented 5 years ago

I've tried adding a period at the end of the string and the bug no longer occurs.

In addition, I looked at the repository of NLPIR and similar issues have already been mentioned. But it seems that the contributor has not given a better solution then adding a period. Fortunately, this bug is not a big problem for me at present.

Anyway, thank you very much @tsroten!