tsroten / pynlpir

A Python wrapper around the NLPIR/ICTCLAS Chinese segmentation software.
MIT License
566 stars 135 forks source link

pynlpir.segment returns inconsistent results for certain inputs when the inputs are given more than once #55

Closed kensk8er closed 8 years ago

kensk8er commented 8 years ago

pynlpir.segment method seems to return inconsistent results for certain inputs. What I mean by "inconsistent results" is that the method returns different list for the same input. In my environment, you can reproduce the issue by following the steps below on REPL (e.g. ipython).

  1. import pynlpir
  2. pynlpir.open()
  3. text = u'父子俩产生矛盾 老父雇凶杀子・广东新闻・珠江三角洲・南方新闻网 黄某华是河源市和平县彭寨镇农民黄某坤的大儿子,父子俩产生矛盾,黄某坤决定"教训一下大儿子",遂花1600元叫小儿子黄 某文雇来凶手痛打黄某华。'
  4. len(pynlpir.segment(text, pos_tagging=False)) # this returns 60
  5. len(pynlpir.segment(text, pos_tagging=True)) # this still returns 60
  6. len(pynlpir.segment(text, pos_tagging=False)) # this now returns 57, not 60!

When the length of the returned list is 60, the tokenization is as follows. [u'父子', u'俩', u'产生', u'矛盾', u' ', u'老父', u'雇', u'凶杀', u'子', u'・广东', u'新闻', u'・珠江', u'三角洲', u'・南方', u'新闻网', u' ', u'黄某', u'华', u'是', u'河源市', u'和平县', u'彭', u'寨', u'镇', u'农民', u'黄某', u'坤', u'的', u'大儿子', u',', u'父子', u'俩', u'产生', u'矛盾', u',', u'黄某', u'坤', u'决定', u'"', u'教训', u'一下', u'大儿子', u'"', u',', u'遂', u'花', u'1600', u'元', u'叫', u'小', u'儿子', u'黄某', u'文', u'雇', u'来', u'凶手', u'痛打', u'黄某', u'华', u'。',]

When the length of the returned list is 57, the tokenization is as follows. [u'父子', u'俩', u'产生', u'矛盾', u' ', u'老父', u'雇', u'凶杀', u'子', u'・广东', u'新闻', u'・珠江', u'三角洲', u'・南方', u'新闻网', u' ', u'黄某', u'华', u'是', u'河源市', u'和平县', u'彭', u'寨', u'镇', u'农民', u'黄某坤', u'的', u'大儿子', u',', u'父子', u'俩', u'产生', u'矛盾', u',', u'黄某坤', u'决定', u'"', u'教训', u'一下', u'大儿子', u'"', u',', u'遂', u'花', u'1600', u'元', u'叫', u'小', u'儿子', u'黄某文', u'雇', u'来', u'凶手', u'痛打', u'黄某', u'华', u'。',]

You can see that pynlpir.segment returned different tokenization here.

It only happens for certain text inputs and the above mentioned text is just an example. I've encountered 10 or so inconsistent tokenization results by running pynlpir.segment on ~2,000 sentences.

I've confirmed that the issue happens on the following environments at least.

OS: OS X 10.11, Ubuntu 14.04 Python: 2.7.11

tsroten commented 8 years ago

PyNLPIR is a simple wrapper around NLPIR's behavior. Have you tried running this in NLPIR directly? My guess is that you'd get the same result.

kensk8er commented 8 years ago

Thanks for the reply! It was a month ago so I don't remember clearly, but I found a way to work around this issue (don't remember if it's NLPIR's bug or not). The fix is in my private fork (not on github) so if you want to integrate this at some point let me know. The commits are a bit messy and also have some changes which you probably don't want to integrate so I need to organize a bit if you want to integrate.

tsroten commented 8 years ago

@kensk8er I'd definitely be interested in taking a look. I'd be happy to look at it as-is, as well :)