gooofy / zamia-speech

Open tools and data for cloudless automatic speech recognition
GNU Lesser General Public License v3.0
444 stars 84 forks source link

Using KenLM instead of SRILM? #51

Closed pguyot closed 5 years ago

pguyot commented 5 years ago

KenLM is apparently used to adapt kaldi models in project https://github.com/gooofy/kaldi-adapt-lm, yet SRILM is used in speech_build_lm.py. What is the rationale? Cannot KenLM be used in speech_build_lm.py phase instead?

gooofy commented 5 years ago

Good point! Actually, replacing srilm by kenlm is pretty high on my todo list and there are no real technical reasons why this couldn't be done. Apart from the usual "never change a running system" I was mainly worried about LM size and the impact it might have on the resulting kaldi models (i.e. the smaller rpi model might need some adjustments to meet realtime performance constraints if the LM turns out to be too large). Therefore I have (once again) postponed the switch for the current round of model runs (a new german model is training right now). I might try switching to kenlm for the next round.

pguyot commented 5 years ago

SRILM simply crashed here after trying to allocate 64GB of RAM. I replaced it with KenLM with the following change, but I am mostly in the dark.

--- a/speech_build_lm.py
+++ b/speech_build_lm.py
@@ -97,6 +97,10 @@ def prune_ngram_model(ngram_path, lm_fn, lm_pruned_fn):

     os.system(cmd)

+def train_pruned_model_with_kenlm(train_fn, lm_fn):
+    cmd = 'lmplz --skip_symbols -o 4 -S 70%% --prune 0 3 5 --text %s > %s' % (train_fn, lm_fn)
+    logging.info(cmd)
+    os.system(cmd)

 init_app(PROC_TITLE)
gooofy commented 5 years ago

interesting - I never noticed any instabilities with srilm. Licensing is a bit of an issue on the other, that's why I am considering kenlm.

pguyot commented 5 years ago

Fixed with commit e2507eb