kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.5k stars 513 forks source link

Interpolation of LM models in kenlm #224

Closed axchanda closed 5 years ago

axchanda commented 5 years ago

Hi, I am not sure if this is an issue or already existing work. But, I want to ask on how to make existing models better with small amount of text. For e.g. I am using deepspeech and it has a lm.binary in its pre-trained model. How can I add some text sentences to its LM so that It has capability to retain its own lm and also the added modules? Please respond at the earliest. Thanks!

kpu commented 5 years ago

Hi! You can always interpolate probabilities from two separate models yourself.

If you'd like to bake them into one model, then the interpolation tool (requires Eigen when compiling) can do that. However, it needs to have the actual ngrams. I'm not sure if your lm.binary is in the probing or trie formats. The probing format hashes ngrams and is therefore not (efficiently) invertible to ngrams so you would want to get the original ARPA file, which I think Mozilla publishes these days. The trie format is convertible back to ARPA. Given that probing is the default, I think you probably have probing.

Why not build a model yourself starting from raw text? Here's some data: http://www.statmt.org/wmt19/translation-task.html http://data.statmt.org/ngrams/deduped/

axchanda commented 5 years ago

Hi @kpu I tried to build a small lm model, just to recognise some key words or key sentences. My text file was having 5~6 lines. When tried to create the arpa files, this error shooted up! image

Any ideas on how to make the DeepSpeech accurate for certain words? So, if I need to recognise only few OOV words or key phrases, then still do I need a huge corpa of text, append and train it? Please clarify. Thanks!

axchanda commented 5 years ago

Any updates on this, please?

kpu commented 5 years ago

This is a feature not a bug. Your training data is too small to be useful.