Open jg1uaa opened 1 year ago
sortlm.py requires following rules:
if this rule is not matched, sortlm.py stops with error.
data.arpa in libkkc-data is tuned to meet the conditons, but normally generated arpa files by LM utility (such as IRSTLM) are not.
so unknown word and word-pair should be discarded.
I am trying to create data.arpa from Nihongo Web Corpus 2010 at https://github.com/jg1uaa/nwc2010-libkkc .
sortlm.py requires following rules:
if this rule is not matched, sortlm.py stops with error.
data.arpa in libkkc-data is tuned to meet the conditons, but normally generated arpa files by LM utility (such as IRSTLM) are not.
so unknown word and word-pair should be discarded.