Tokenization in Korean, plus abjad languages

rspeer / wordfreq

Access a database of word frequencies, in various natural languages.

Other

1.4k stars 101 forks source link

Tokenization in Korean, plus abjad languages #38

Closed rspeer closed 8 years ago

rspeer commented 8 years ago

This branch adds support for Korean tokenization via MeCab. It now includes Korean and Japanese MeCab data files as subdirectories of wordfreq/data, instead of assuming the Japanese data is installed system-wide.

Mostly unrelatedly, it also supports tokenizing other languages written with abjad scripts, such as Hebrew and Persian, though there is no frequency data for these languages yet.

rspeer commented 8 years ago

Hm, I accidentally included an unrelated change to wordfreq_builder. Let me untangle that.