bug in tokenize - Githubissues

veer66 / wordcutpy

A simple word breaker written in Python

18 stars 8 forks source link

bug in tokenize #6

Open wannaphong opened 6 years ago

wannaphong commented 6 years ago

Try

>>> from wordcut import Wordcut
>>> wordcut = Wordcut.bigthai()
>>> print(wordcut.tokenize("จุ๋มสบายดีไหม"))
['จ', 'ุ๋ม', 'สบาย', 'ดี', 'ไหม']

pepa65 commented 4 years ago

Many (most?) wordcutters have this 'bug': if the order is consonant, tone-sign, top/bottom-vowel, it is not recognized, because the word lists only have the proper order (consonant, top/bottom-vowel, tone-sign).