support for NeMo KenLM BPE encoded files

jedzill4 commented 2 years ago

Hi! Following the NeMo tutorial, I trained a KenLM ngram model based on BPE. In the context of NeMo, they only support char level for beam search decoding (OpenSeq2Seq ctc-decoder). To handle BPE-based tokenizers (e.i. sentencepiece) they encode BPE tokens to unicode characters using a standard unicode table and a TOKEN_OFFSET as the first id to use. This trick makes the arpa/binary files much lighter, but as a tradeoff the decoder needs to decode the correct BPE ids again. In the current implementation the pyctcdecoder tag the encoded-BPE-tokens as unknown, so the result, in practice, is the same as not using an LM at all. I am struggling in understanding both implementations (NeMo and yours) and write a possible solution. Do you have any idea where I should look?

Great work btw!

Resources:

gkucsko commented 2 years ago

Hey, thanks for having a look. The main differences in regards to BPE handling are:

NeMo redefines BPE tokens to be the main unit of analysis. Meaning an n-gram language model uses n bpe tokens as context. That way it works with the existing C++ decoder implementations meant for chatacters because the bpe rtokens get remapped onto unicode characters. That way it effectively behaves like a character level decoding.
pyctcdecode merges the bpe units back to words during decoding and then uses a word-base LM for scoring. So an n-gram LM remains a n-word LM

So the advantage of the nemo approach is that you can re-use an existing decoder implementation, and the advantage of pyctc is that the LM remains word-based and therefor has a longer n-gram context. Also, you can use the same LM for bpe and non bpe because the decoder takes care of the merging. Does that make sense?

gkucsko commented 2 years ago

Not sure what you mean by "the pyctcdecoder tag the encoded-BPE-tokens as unknown", could you elaborate?

gkucsko commented 2 years ago

closing. feel free to re-open if still an issue

averkij commented 2 years ago

@jedzill4 It can be achieved if you'll pass the unicoded alphabet with "_" prefix. Pyctcdecoder will treat each symbol as a word and all will be fine. See the #60 issue.

kensho-technologies / pyctcdecode

support for NeMo KenLM BPE encoded files #30