kensho-technologies / pyctcdecode

A fast and lightweight python-based CTC beam search decoder for speech recognition.
Apache License 2.0
421 stars 89 forks source link

support for NeMo KenLM BPE encoded files #30

Closed jedzill4 closed 2 years ago

jedzill4 commented 2 years ago

Hi! Following the NeMo tutorial, I trained a KenLM ngram model based on BPE. In the context of NeMo, they only support char level for beam search decoding (OpenSeq2Seq ctc-decoder). To handle BPE-based tokenizers (e.i. sentencepiece) they encode BPE tokens to unicode characters using a standard unicode table and a TOKEN_OFFSET as the first id to use. This trick makes the arpa/binary files much lighter, but as a tradeoff the decoder needs to decode the correct BPE ids again. In the current implementation the pyctcdecoder tag the encoded-BPE-tokens as unknown, so the result, in practice, is the same as not using an LM at all. I am struggling in understanding both implementations (NeMo and yours) and write a possible solution. Do you have any idea where I should look?

Great work btw!

Resources:

gkucsko commented 2 years ago

Hey, thanks for having a look. The main differences in regards to BPE handling are:

So the advantage of the nemo approach is that you can re-use an existing decoder implementation, and the advantage of pyctc is that the LM remains word-based and therefor has a longer n-gram context. Also, you can use the same LM for bpe and non bpe because the decoder takes care of the merging. Does that make sense?

gkucsko commented 2 years ago

Not sure what you mean by "the pyctcdecoder tag the encoded-BPE-tokens as unknown", could you elaborate?

gkucsko commented 2 years ago

closing. feel free to re-open if still an issue

averkij commented 2 years ago

@jedzill4 It can be achieved if you'll pass the unicoded alphabet with "_" prefix. Pyctcdecoder will treat each symbol as a word and all will be fine. See the #60 issue.