snowballstem / snowball

Snowball compiler and stemming algorithms
https://snowballstem.org/
BSD 3-Clause "New" or "Revised" License
757 stars 173 forks source link

UnicodeDecodeError with Danish stemmer #89

Closed itniemin closed 5 years ago

itniemin commented 6 years ago

PyStemmer throws a UnicodeDecodeError on specific input involving certain emojis with Danish stemmer (note the difference in inputs, the second has two 'a' in between the emojis):

> mkvirtualenv -p /usr/bin/python3 stemmer
...
> pip install PyStemmer
...
Successfully installed PyStemmer-1.3.0
> python3 --version
Python 3.6.6
> python3 -c "import Stemmer; print(Stemmer.Stemmer('da').stemWord(b'\xf0\x9f\x98\x98a\xf0\x9f\x98\x98'.decode('utf-8')))"
๐Ÿ˜˜a๐Ÿ˜˜
> python3 -c "import Stemmer; print(Stemmer.Stemmer('da').stemWord(b'\xf0\x9f\x98\x98aa\xf0\x9f\x98\x98'.decode('utf-8')))"
Traceback (most recent call last):
  File "Stemmer.pyx", line 184, in Stemmer.Stemmer.stemWord (src/Stemmer.c:1669)
KeyError: b'\xf0\x9f\x98\x98aa\xf0\x9f\x98\x98'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "Stemmer.pyx", line 192, in Stemmer.Stemmer.stemWord (src/Stemmer.c:1772)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 6-8: unexpected end of data

Other tested languages work:

> python3 -c "import Stemmer; [Stemmer.Stemmer(lang).stemWord(b'\xf0\x9f\x98\x98aa\xf0\x9f\x98\x98'.decode('utf-8')) for lang in ('en', 'sv', 'fi', 'fr', 'de')]" && echo 'ok'
ok
ojwb commented 5 years ago

Looks like a bug in the C code rather than anything Python-specific:

$ python3 -c 'print(b"\xf0\x9f\x98\x98aa\xf0\x9f\x98\x98".decode("utf-8"))'|./stemwords -l danish|xxd -g1
00000000: f0 9f 98 98 61 61 f0 9f 98 0a                    ....aa....

And in fact the pure Python versions of the stemmers handle this correctly:

$ python3 -c 'print(b"\xf0\x9f\x98\x98aa\xf0\x9f\x98\x98".decode("utf-8"))'|python ./stemwords.py -l danish -i /dev/stdin -o /dev/stdout|xxd -g1
00000000: f0 9f 98 98 61 61 f0 9f 98 98 0a                 ....aa.....
ojwb commented 5 years ago

The mangling of UTF-8 here is due to the C runtime support currently only handling UTF-8 sequences up to 3 bytes (the emoji in your example is 4 bytes) which leads to the cursor ending up part way through a UTF-8 sequence. That's easily fixed by extending the code to handle 4 bytes sequences (which are the longest valid sequences by RFC 3629).

There's a side issue that currently the test for "non-vowel" allows any character which isn't a vowel - it would make more sense to restrict it to letters valid in Danish. That contributes here because this emoji happens to have the same final two bytes in UTF-8 encoding, and one is deleted because the stemmer thinks it is "undoubling a non-vowel". This is already noted as something to address at https://github.com/snowballstem/snowball/issues/81.