ueno / libkkc

Japanese Kana Kanji conversion input method library
GNU General Public License v3.0
106 stars 15 forks source link

improvement for accept generic arpa language-model data #46

Open jg1uaa opened 1 year ago

jg1uaa commented 1 year ago

sortlm.py requires following rules:

if this rule is not matched, sortlm.py stops with error.

data.arpa in libkkc-data is tuned to meet the conditons, but normally generated arpa files by LM utility (such as IRSTLM) are not.

so unknown word and word-pair should be discarded.

jg1uaa commented 1 year ago

I am trying to create data.arpa from Nihongo Web Corpus 2010 at https://github.com/jg1uaa/nwc2010-libkkc .