When tokenizing Japanese or Korean, MeCab's dictionaries no longer have to be installed separately as system packages. They can now be found via the Python packages ipadic and mecab-ko-dic.
When the tokenizer had to infer word boundaries in languages without spaces, inputs that were too long (such as the letter 'l' repeated 800 times) were causing overflow errors. Changed the sequence of operations so that it no longer overflows, and such inputs simply get a frequency of 0.
When tokenizing Japanese or Korean, MeCab's dictionaries no longer have to be installed separately as system packages. They can now be found via the Python packages
ipadic
andmecab-ko-dic
.When the tokenizer had to infer word boundaries in languages without spaces, inputs that were too long (such as the letter 'l' repeated 800 times) were causing overflow errors. Changed the sequence of operations so that it no longer overflows, and such inputs simply get a frequency of 0.
The
mecab-ko-dic
Python package is not released quite yet, and its release should be considered as part of this pull request: https://github.com/LuminosoInsight/mecab-ko-dic