Closed pguyot closed 5 years ago
Good point! Actually, replacing srilm by kenlm is pretty high on my todo list and there are no real technical reasons why this couldn't be done. Apart from the usual "never change a running system" I was mainly worried about LM size and the impact it might have on the resulting kaldi models (i.e. the smaller rpi model might need some adjustments to meet realtime performance constraints if the LM turns out to be too large). Therefore I have (once again) postponed the switch for the current round of model runs (a new german model is training right now). I might try switching to kenlm for the next round.
SRILM simply crashed here after trying to allocate 64GB of RAM. I replaced it with KenLM with the following change, but I am mostly in the dark.
--- a/speech_build_lm.py
+++ b/speech_build_lm.py
@@ -97,6 +97,10 @@ def prune_ngram_model(ngram_path, lm_fn, lm_pruned_fn):
os.system(cmd)
+def train_pruned_model_with_kenlm(train_fn, lm_fn):
+ cmd = 'lmplz --skip_symbols -o 4 -S 70%% --prune 0 3 5 --text %s > %s' % (train_fn, lm_fn)
+ logging.info(cmd)
+ os.system(cmd)
init_app(PROC_TITLE)
interesting - I never noticed any instabilities with srilm. Licensing is a bit of an issue on the other, that's why I am considering kenlm.
Fixed with commit e2507eb
KenLM is apparently used to adapt kaldi models in project https://github.com/gooofy/kaldi-adapt-lm, yet SRILM is used in
speech_build_lm.py
. What is the rationale? Cannot KenLM be used inspeech_build_lm.py
phase instead?