kensho-technologies / pyctcdecode

A fast and lightweight python-based CTC beam search decoder for speech recognition.
Apache License 2.0
422 stars 89 forks source link

Transcription being concatenated oddly #5

Closed usmanfarooq619 closed 3 years ago

usmanfarooq619 commented 3 years ago

I am trying to use the ctc decoding feature with kenlm on the wav2vec2 huggingface's logits

vocab = ['l', 'z', 'u', 'k', 'f', 'r', 'g', 'i', 'v', 's', 'o', 'b', 'w', 'e', 'd', 'n', 'y', 'c', 'q', 'p', 'h', 't', 'a', 'x', ' ', 'j', 'm', '⁇', '', '⁇', '⁇']
alphabet = Alphabet.build_alphabet(vocab, ctc_token_idx=-3)
# Language Model
lm=LanguageModel(kenlm_model,alpha =0.169,
  beta = 0.055)
# build the decoder and decode the logits
decoder = BeamSearchDecoderCTC(alphabet,lm)

which returns the following output with beam size 64:

yeah jon okay i m calling from the clinic the family doctor clinessegryand this number six four five five one three o five

while when I was previously decoding with https://github.com/ynop/py-ctc-decode with the same lm and parameters getting:

yeah on okay i am calling from the clinic the family dot clinic try and this number six four five five one three o five

I don't understand why the words are being concatenated together. Do you have any thoughts?

gkucsko commented 3 years ago

hey, this looks like an OOV issue. since kenlm doesn't have a notion of partial tokens it helps a lot to pass a list of known unigrams (basically all the words you expect to appear). pyctcdecode can then under the hood build a character trie, which can be probed very efficiently during decoding and OOV words will be efficiently downweighted as soon as they appear (rather than only indirectly through the LM once a space appears) let me know if adding the unigrams during instantiation of the LM fixes it or we should look into it more carefully

sarim-zafar commented 3 years ago

That just sounds like a hotfix to me tbh. According to your reasoning the same should happen with the other implementation as well

gkucsko commented 3 years ago

it could be that you can reproduce the same behavior of the other decoder by lowering some of the thresholds during prediction, for example beam_prune_logp=-20 and token_min_logp=-8. The default parameters are tuned for high speed at similar accuracy in cases where LM as well as unigrams are present (the decoder prunes hypotheses continuously during decoding to minimize the number of beam proposals to maximize speed to compete with c++ implementations). Is there a way you can check this on public data or share something with us so that i can try to reproduce? We usually always have access to a unigram list since we train the lm model, but very interested to hear if that's different for you and happy to work on optimizing the scenario without unigrams provided. Could also of course be that there is some other sneaky bug here that doesn't have anything to do with unigrams, but it's hard for me to test without being able to reproduce it.

gkucsko commented 3 years ago

if the 'no unigrams available' is a scenario that is important for you that's great to know. pretty sure the word concatenation can be fixed by adjusting how we score partial words (which is usually done by the trie made from the unigram list).

gkucsko commented 3 years ago

oh, looking at the other repo, i believe it reads the unigrams from the arpa kenlm file? then you don't have to provide the list separately. however then you can't use a binary kenlm file for instantiation since as far as i know they don't contain the ngrams in an easy to read format

sarim-zafar commented 3 years ago

Okay so I also followed the unigram option and it fixes the problem for now but this probably won't work for example for new names that we've never seen before. Let me see if I can provide you the logits.

As for the other implementation, I was using the simple ARPA file so it might be the case as you suggested but I don't think the binary one should be that big of a problem.

Also how about a simple extension to enable support for transformer-based lm scorers: https://github.com/simonepri/lm-scorer

gkucsko commented 3 years ago

great, i will put it on my todo list to provide better support around the unigram list. new words are always a tricky thing, because if they are not in your kenlm model, then they will be scored as an OOV anyways. the likelihood of that you can tune with a separate parameter. another good way however to deal with new words (that you know about, but are not yet in your language model) is to provide them as 'hotwords'. that way they will get compile into their own scorer and increase the likelihood of being transcribed. you can also tune their weight to decide how important they are. In regards to neural LM models, it's definitely something we are curious about if people are interested in. Is your main goal to get better results while accepting a slower output? have you tried applying it in a beam re-scoring manner after getting the full outputs to see if you can improve your results?

sarim-zafar commented 3 years ago

I've tried it in a beam-scoring manner but that assumes that you get good beam predictions to begin with. So I think it'd be wonderful if you can patch that in. Obviously one can use the insanely fast and light transformers to get a better middleground. Looking forward to the fix and as well as the new feature. Thank you for the wonderful work!

gkucsko commented 3 years ago

sounds great will have a look. thanks for the feedback and let me know if other issues come up!

gkucsko commented 3 years ago

see PR #4 for some additional warnings around unigrams as well as improved partial scoring without a trie that should help with word concatenation