This branch adds support for Korean tokenization via MeCab. It now includes Korean and Japanese MeCab data files as subdirectories of wordfreq/data, instead of assuming the Japanese data is installed system-wide.
Mostly unrelatedly, it also supports tokenizing other languages written with abjad scripts, such as Hebrew and Persian, though there is no frequency data for these languages yet.
This branch adds support for Korean tokenization via MeCab. It now includes Korean and Japanese MeCab data files as subdirectories of
wordfreq/data
, instead of assuming the Japanese data is installed system-wide.Mostly unrelatedly, it also supports tokenizing other languages written with abjad scripts, such as Hebrew and Persian, though there is no frequency data for these languages yet.