tsroten / pynlpir

A Python wrapper around the NLPIR/ICTCLAS Chinese segmentation software.
MIT License
566 stars 135 forks source link

Having problem when trying to segment a string '[ / ]' #96

Open stickjitb opened 6 years ago

stickjitb commented 6 years ago

When I try to segment a string containing the pattern '[ / ]', an UnboundLocalError has occurred.

No errors when trying to segment '[/]' or '/' or '[ ]' where stands for a symbol other than /.

Here are the outputs:

File "NER_train.py", line 8, in s = pynlpir.segment('[ / ]') File "D:\Python\lib\site-packages\pynlpir__init.py", line 248, in segment pos_name = _get_pos_name(token[1], pos_names, pos_english) File "D:\Python\lib\site-packages\pynlpir\init__.py", line 193, in _get_pos_name pos_name = pos_map.get_pos_name(code, name, english) File "D:\Python\lib\site-packages\pynlpir\pos_map.py", line 190, in get_pos_name return _get_pos_name(code, name, english) File "D:\Python\lib\site-packages\pynlpir\pos_map.py", line 151, in _get_pos_name pos = (pos_entry[1 if english else 0], ) UnboundLocalError: local variable 'pos_entry' referenced before assignment

tsroten commented 6 years ago

Hello @stickjitb! Unfortunately, NLPIR (the library we use to segment text), uses / as the separator between tokens that it segments. Here is a typical example that NLPIR returns for hi there

hi/o there/rzs

You'll notice that it uses spaces between tokens. And, it uses a / to separate the token from the part of speech.

In your example, this is returned:

[ / ]/xm

That breaks the format that the NLPIR project has decided to use for their token separation.

There really isn't anything we can do on the PyNLPIR side about this. You might try talking to the NLPIR team on their GitHub project or website: https://github.com/NLPIR-team/NLPIR http://ictclas.nlpir.org/

tsroten commented 6 years ago

We could get fancy in how we process the tokens from NLPIR by using look-ahead assertions in a regular expression (like only splitting on / if it has [a-z] immediately following it), but this doesn't seem like a common enough problem or a typical enough text sample to make that worthwhile.

stickjitb commented 6 years ago

@tsroten OK, I get it. If that is the case, I don't think the NLPIR team will have any solution because you always need a pattern as a separator. Anyway I've reported this issue to them.

Just as you said, the pattern '[ / ]' is not typical enough so maybe I should just take some ad hoc measures if I need to deal with text containing that and no better solutions can be proposed.