kensho-technologies / pyctcdecode

A fast and lightweight python-based CTC beam search decoder for speech recognition.
Apache License 2.0
428 stars 90 forks source link

KenLM with unicode hack (decreased LM size) #60

Closed averkij closed 2 years ago

averkij commented 2 years ago

πŸ‘‹ Hello! Maybe it would be useful for the community. It's about how to decrease your model size in several times.

πŸ”¨ I've worked with different ctc decoders and used one from the NeMo repo. Specially for it I've trained the KenLM with this scripts. NeMo ctc_decoder don't support the BPE tokenization so they developed a trick.

🎯 My 6-gram word level model is 1.2 Gb and unicoded 6-gram token-based model is 300Mb and performs better both on speed and quality.

I've managed to get it working with pyctcdecoder. It's another hack here -- I'm passing alphabet with "_" symbol before all the unicode chars so the decoder treat them as a separate words and do the proper rescoring.

πŸ”­ On the other hand, the support of hotwords has been broken with trick. Maybe you'll have some thoughts on this, because this trick is pretty useful.

gkucsko commented 2 years ago

Hey, thanks. we are not officially supporting this re-mapping, since it effectively turns a word level LM into a wordpiece LM, which means less context (which is where the size reduction mostly comes from). Feel free to have a look into using pyctc with the remapping, but unless there is an overwhelmingly good reason to support it, we probably shouldn't write code for that in the main repo.

averkij commented 2 years ago

Ok, I've got it.