KenLM with unicode hack (decreased LM size)

averkij commented 2 years ago

👋 Hello! Maybe it would be useful for the community. It's about how to decrease your model size in several times.

🔨 I've worked with different ctc decoders and used one from the NeMo repo. Specially for it I've trained the KenLM with this scripts. NeMo ctc_decoder don't support the BPE tokenization so they developed a trick.

Consider the token ids as as a unicode characters ([char(i) + TOKEN_OFFSET for i in range(tokenizer.vocab_size)) to imitate the simple alphabet.
Train KenLM at the character level with this characters.
After rescoring shift symbols from beam back ([ord(c) - TOKEN_OFFSET for c in transcription]).
Decode ids with tokenizer.

🎯 My 6-gram word level model is 1.2 Gb and unicoded 6-gram token-based model is 300Mb and performs better both on speed and quality.

I've managed to get it working with pyctcdecoder. It's another hack here -- I'm passing alphabet with "_" symbol before all the unicode chars so the decoder treat them as a separate words and do the proper rescoring.

🔭 On the other hand, the support of hotwords has been broken with trick. Maybe you'll have some thoughts on this, because this trick is pretty useful.

gkucsko commented 2 years ago

Hey, thanks. we are not officially supporting this re-mapping, since it effectively turns a word level LM into a wordpiece LM, which means less context (which is where the size reduction mostly comes from). Feel free to have a look into using pyctc with the remapping, but unless there is an overwhelmingly good reason to support it, we probably shouldn't write code for that in the main repo.

averkij commented 2 years ago

Ok, I've got it.

kensho-technologies / pyctcdecode

KenLM with unicode hack (decreased LM size) #60