UnicodeDecodeError in run_tagger.py with English tagger.

tunystom commented 10 years ago

I have encountered the following issue when I tested the example code for python bindings:

echo "manifestation of the people’s ‘mental enslavement’" | python run_tagger.py english-morphium-wsj-140407.tagger

The following error pops up:

Traceback (most recent call last):
  File "run_tagger.py", line 81, in <module>
    encode_entities(lemma.lemma),
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-7: unexpected end of data

The exception is raised on hitting the word people ending with the ’ (forward-quote). Seems that the string people’s is truncated in the middle of the multibyte UTF8 code sequence for the quote which is \xe2\x80\x99.

The taggers for Czech seem to work fine, at least they do not fail on the quotes.

I am using python2.7 and builded the code and bindings from source on Ubuntu 12.04 with proper versions of g++/swig.

foxik commented 10 years ago

Confirmed, the problem is in the MorphoDiTa library itself, in English tokenizer. Will fix it when I get back from vacation (18th August).

The issue can be sidestepped by manually tokenizing the input and not using English tokenizer.

foxik commented 10 years ago

Fixed by 01588cf.

New stable version 1.3 containing the fix has been released, on Github, CPAN and PyPI.

ufal / morphodita

UnicodeDecodeError in run_tagger.py with English tagger. #1