Open patarapolw opened 2 years ago
I recognize that ipadic is old and busted, but unfortunately, using Unidic would cause the tokenization not to match the word frequencies that were extracted from the source corpora. We would have to re-compute the frequencies.
Also unfortunately, one of the source corpora belongs to my old company and I have no access to it.
mecab-python3
itself doesn't recommendipadic
anymore.Furthermore, other tokenizers might also be considered. (but a little out of scope, and can create more confusion, perhaps.)