Closed usmanfarooq619 closed 3 years ago
hey, this looks like an OOV issue. since kenlm doesn't have a notion of partial tokens it helps a lot to pass a list of known unigrams
(basically all the words you expect to appear). pyctcdecode can then under the hood build a character trie, which can be probed very efficiently during decoding and OOV words will be efficiently downweighted as soon as they appear (rather than only indirectly through the LM once a space appears)
let me know if adding the unigrams
during instantiation of the LM fixes it or we should look into it more carefully
That just sounds like a hotfix to me tbh. According to your reasoning the same should happen with the other implementation as well
it could be that you can reproduce the same behavior of the other decoder by lowering some of the thresholds during prediction, for example beam_prune_logp=-20
and token_min_logp=-8
. The default parameters are tuned for high speed at similar accuracy in cases where LM as well as unigrams are present (the decoder prunes hypotheses continuously during decoding to minimize the number of beam proposals to maximize speed to compete with c++ implementations).
Is there a way you can check this on public data or share something with us so that i can try to reproduce? We usually always have access to a unigram list since we train the lm model, but very interested to hear if that's different for you and happy to work on optimizing the scenario without unigrams provided. Could also of course be that there is some other sneaky bug here that doesn't have anything to do with unigrams, but it's hard for me to test without being able to reproduce it.
if the 'no unigrams available' is a scenario that is important for you that's great to know. pretty sure the word concatenation can be fixed by adjusting how we score partial words (which is usually done by the trie made from the unigram list).
oh, looking at the other repo, i believe it reads the unigrams from the arpa kenlm file? then you don't have to provide the list separately. however then you can't use a binary kenlm file for instantiation since as far as i know they don't contain the ngrams in an easy to read format
Okay so I also followed the unigram option and it fixes the problem for now but this probably won't work for example for new names that we've never seen before. Let me see if I can provide you the logits.
As for the other implementation, I was using the simple ARPA file so it might be the case as you suggested but I don't think the binary one should be that big of a problem.
Also how about a simple extension to enable support for transformer-based lm scorers: https://github.com/simonepri/lm-scorer
great, i will put it on my todo list to provide better support around the unigram list. new words are always a tricky thing, because if they are not in your kenlm model, then they will be scored as an OOV anyways. the likelihood of that you can tune with a separate parameter. another good way however to deal with new words (that you know about, but are not yet in your language model) is to provide them as 'hotwords'. that way they will get compile into their own scorer and increase the likelihood of being transcribed. you can also tune their weight to decide how important they are. In regards to neural LM models, it's definitely something we are curious about if people are interested in. Is your main goal to get better results while accepting a slower output? have you tried applying it in a beam re-scoring manner after getting the full outputs to see if you can improve your results?
I've tried it in a beam-scoring manner but that assumes that you get good beam predictions to begin with. So I think it'd be wonderful if you can patch that in. Obviously one can use the insanely fast and light transformers to get a better middleground. Looking forward to the fix and as well as the new feature. Thank you for the wonderful work!
sounds great will have a look. thanks for the feedback and let me know if other issues come up!
see PR #4 for some additional warnings around unigrams as well as improved partial scoring without a trie that should help with word concatenation
I am trying to use the ctc decoding feature with kenlm on the wav2vec2 huggingface's logits
which returns the following output with beam size 64:
while when I was previously decoding with https://github.com/ynop/py-ctc-decode with the same lm and parameters getting:
I don't understand why the words are being concatenated together. Do you have any thoughts?