tsroten / pynlpir

A Python wrapper around the NLPIR/ICTCLAS Chinese segmentation software.
MIT License
566 stars 135 forks source link

pynlpir.segment() hangs on English word with newline #33

Closed fayeshine closed 8 years ago

fayeshine commented 9 years ago

If you run pynlpir.segment('E\n'), then the program is stuck -- if there's 'n'` and an English word. It is easy to get this bug, please fix this, thanks.

tsroten commented 9 years ago

@fayeshine Thank your for reporting this bug. This seems to be a problem with NLPIR itself, not PyNLPIR. Here are some tests:

>>> nlpir.ParagraphProcess('我们\n我们'.encode('utf8'), False)
b'\xe6\x88\x91\xe4\xbb\xac \n\xe6\x88\x91\xe4\xbb\xac '
>>> nlpir.ParagraphProcess('我们\n'.encode('utf8'), False)
b'\xe6\x88\x91\xe4\xbb\xac \n'
>>> nlpir.ParagraphProcess('我们\ntest'.encode('utf8'), False)
b'\xe6\x88\x91\xe4\xbb\xac \ntest '
>>> nlpir.ParagraphProcess('test\n我们'.encode('utf8'), False)
b'test \n\xe6\x88\x91\xe4\xbb\xac '
>>> nlpir.ParagraphProcess('test\n我们\n'.encode('utf8'), False)
b'test \n\xe6\x88\x91\xe4\xbb\xac \n'
>>> nlpir.ParagraphProcess('test\n'.encode('utf8'), False)
[...NLPIR hangs...]

So, an easy solution is to strip any newlines that appear at the end of the input string before calling nlpir.ParagraphProcess. The problem does not seem to affect nlpir.GetKeyWords.

This would be a simple addition to pynlpir.segment. We'll leave nlpir.ParagraphProcess alone. Anyone willing to submit a pull request for this?

tsroten commented 8 years ago

@fayeshine Okay, I've fixed this in the latest develop branch. I'll publish a release to PyPi shortly. Thanks again!