Closed jedzill4 closed 2 years ago
Hey, thanks for having a look. The main differences in regards to BPE handling are:
NeMo
redefines BPE tokens to be the main unit of analysis. Meaning an n-gram language model uses n bpe tokens as context. That way it works with the existing C++ decoder implementations meant for chatacters because the bpe rtokens get remapped onto unicode characters. That way it effectively behaves like a character level decoding.pyctcdecode
merges the bpe units back to words during decoding and then uses a word-base LM for scoring. So an n-gram LM remains a n-word LM So the advantage of the nemo approach is that you can re-use an existing decoder implementation, and the advantage of pyctc is that the LM remains word-based and therefor has a longer n-gram context. Also, you can use the same LM for bpe and non bpe because the decoder takes care of the merging. Does that make sense?
Not sure what you mean by "the pyctcdecoder tag the encoded-BPE-tokens as unknown", could you elaborate?
closing. feel free to re-open if still an issue
@jedzill4 It can be achieved if you'll pass the unicoded alphabet with "_" prefix. Pyctcdecoder will treat each symbol as a word and all will be fine. See the #60 issue.
Hi! Following the NeMo tutorial, I trained a KenLM ngram model based on BPE. In the context of NeMo, they only support char level for beam search decoding (OpenSeq2Seq ctc-decoder). To handle BPE-based tokenizers (e.i. sentencepiece) they encode BPE tokens to unicode characters using a standard unicode table and a TOKEN_OFFSET as the first id to use. This trick makes the arpa/binary files much lighter, but as a tradeoff the decoder needs to decode the correct BPE ids again. In the current implementation the pyctcdecoder tag the encoded-BPE-tokens as unknown, so the result, in practice, is the same as not using an LM at all. I am struggling in understanding both implementations (NeMo and yours) and write a possible solution. Do you have any idea where I should look?
Great work btw!
Resources: