Closed BLKSerene closed 2 years ago
Just a note: Moses tokenizer has the same behavior:
$ echo "தமிழ் மொழி (Tamil language) தமிழர்களினதும், தமிழ் பேசும் பலரதும் தாய்மொழி ஆகும்." | tokenizer.perl -q -no-escape -l ta
தமிழ ் மொழி ( Tamil language ) தமிழர ் களினதும ் , தமிழ ் பேசும ் பலரதும ் தாய ் மொழி ஆகும ் .
Yes actually this is the same as the Hindi problem at #42
There's a way to resolve this but it requires a little more digging and understanding of Indian languages in the unicode charset =(
This is caused by https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L41 padding spaces to characters which it this isn't alphanumeric from https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L24
Adding the ்
character to the IsAlnum list and uncommenting the character will fix this issue partially but there's actually more problems than that unicode character.
And the problem also exists for Russian:
>>> import sacremoses
>>> TEXT_RUS = 'Ру́сский язы́к ([ˈruskʲɪi̯ jɪˈzɨk] Информация о файле слушать)[~ 3][⇨] — один из восточнославянских языков, национальный язык русского народа.'
>>> EXPECTED_TOKENS = ['Ру́сский', 'язы́к', '(', '[', 'ˈruskʲɪi̯', 'jɪˈzɨk', ']', 'Информация', 'о', 'файле', 'слушать', ')', '[', '~', '3', ']', '[', '⇨', ']', '—', 'один', 'из', 'восточнославянских', 'языков', ',', 'национальный', 'язык', 'русского', 'народа', '.']
>>> t = sacremoses.MosesTokenizer(lang = 'ru')
>>> t.tokenize(TEXT_RUS)
['Ру', '́', 'сский', 'язы', '́', 'к', '(', '[', 'ˈruskʲɪi', '̯', 'jɪˈzɨk', ']', 'Информация', 'о', 'файле', 'слушать', ')', '[', '~', '3', ']', '[', '⇨', ']', '—', 'один', 'из', 'восточнославянских', 'языков', ',', 'национальный', 'язык', 'русского', 'народа', '.']
>>> assert t.tokenize(TEXT_RUS) == EXPECTED_TOKENS
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
assert t.tokenize(TEXT_RUS) == EXPECTED_TOKENS
AssertionError
Same outputs from default mosesdecoder
(Commit: https://github.com/moses-smt/mosesdecoder/commit/05788925812f0d3265e355565cbb1701a0ad7510) :
$ echo "Ру́сский язы́к ([ˈruskʲɪi̯ jɪˈzɨk] Информация о файле слушать)[~ 3][⇨] — один из восточнославянских языков, национальный язык русского народа." | perl tokenizer.perl -l ru
Tokenizer Version 1.1
Language: ru
Number of threads: 1
Ру ́ сский язы ́ к ( [ ˈruskʲɪi ̯ jɪˈzɨk ] Информация о файле слушать ) [ ~ 3 ] [ ⇨ ] — один из восточнославянских языков , национальный язык русского народа .
Hi, the results for Tamil tokenization is weird: