rspeer / wordfreq

Access a database of word frequencies, in various natural languages.
Other
1.4k stars 101 forks source link

How to use Unidic for Japanese? #99

Open patarapolw opened 2 years ago

patarapolw commented 2 years ago

mecab-python3 itself doesn't recommend ipadic anymore.

In order to use MeCab, you must install a dictionary. There are many different dictionaries available for MeCab. These UniDic packages, which include slight modifications for ease of use, are recommended: - [unidic](https://github.com/polm/unidic-py): The latest full UniDic. - [unidic-lite](https://github.com/polm/unidic-lite): A slightly modified UniDic 2.1.2, chosen for its small size. The dictionaries below are not recommended due to being unmaintained for many years, but they are available for use with legacy applications. - [ipadic](https://github.com/polm/ipadic-py) - [jumandic](https://github.com/polm/jumandic-py) For more details on the differences between dictionaries see [here](https://www.dampfkraft.com/nlp/japanese-tokenizer-dictionaries.html).

Furthermore, other tokenizers might also be considered. (but a little out of scope, and can create more confusion, perhaps.)

rspeer commented 2 years ago

I recognize that ipadic is old and busted, but unfortunately, using Unidic would cause the tokenization not to match the word frequencies that were extracted from the source corpora. We would have to re-compute the frequencies.

Also unfortunately, one of the source corpora belongs to my old company and I have no access to it.