kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.5k stars 512 forks source link

Merging ARPA lms #191

Open Break-Neck opened 5 years ago

Break-Neck commented 5 years ago

Hi, could you please tell me, can I merge a few large lms in ARPA format using KenLM? I looked through existing issues, but couldn't find an answer:

So could you please clarify it? I haven't found any other good tool for automatic log-linear merge weights fitting on a corpus. Thank you for your project!

kpu commented 5 years ago

Interpolation was developed around the time neural networks took over, so it has rough edges. So currently the interpolation tool only knows how to take intermediate format and there isn't an ARPA->intermediate tool but one could make one.
The intermediate format is relatively simple. Separate files for each order containing records for n-grams. Each record is an array of 32-bit vocab ids, 32-bit float log10 probability, and 32-bit float log10 backoff (except highest order doesn't have backoff). Files are sorted in suffix order. Unknown must be id 0. And a small piece of metadata about order that you can see in the examples generated by lmplz. Vocabulary is a separate file with strings in order null-delimited.

sarahberanek commented 3 years ago

Did somebody looked into it and implemented such a tool? I am also very much interested in interpolating with an existing ARPA LM.

khoanguyenvietmanh commented 3 years ago

Is there a tool for converting .arpa file to intermediate file that you suggest?