Closed itniemin closed 5 years ago
Looks like a bug in the C code rather than anything Python-specific:
$ python3 -c 'print(b"\xf0\x9f\x98\x98aa\xf0\x9f\x98\x98".decode("utf-8"))'|./stemwords -l danish|xxd -g1
00000000: f0 9f 98 98 61 61 f0 9f 98 0a ....aa....
And in fact the pure Python versions of the stemmers handle this correctly:
$ python3 -c 'print(b"\xf0\x9f\x98\x98aa\xf0\x9f\x98\x98".decode("utf-8"))'|python ./stemwords.py -l danish -i /dev/stdin -o /dev/stdout|xxd -g1
00000000: f0 9f 98 98 61 61 f0 9f 98 98 0a ....aa.....
The mangling of UTF-8 here is due to the C runtime support currently only handling UTF-8 sequences up to 3 bytes (the emoji in your example is 4 bytes) which leads to the cursor ending up part way through a UTF-8 sequence. That's easily fixed by extending the code to handle 4 bytes sequences (which are the longest valid sequences by RFC 3629).
There's a side issue that currently the test for "non-vowel" allows any character which isn't a vowel - it would make more sense to restrict it to letters valid in Danish. That contributes here because this emoji happens to have the same final two bytes in UTF-8 encoding, and one is deleted because the stemmer thinks it is "undoubling a non-vowel". This is already noted as something to address at https://github.com/snowballstem/snowball/issues/81.
PyStemmer throws a
UnicodeDecodeError
on specific input involving certain emojis with Danish stemmer (note the difference in inputs, the second has two 'a' in between the emojis):Other tested languages work: