kensho-technologies / pyctcdecode

A fast and lightweight python-based CTC beam search decoder for speech recognition.
Apache License 2.0
416 stars 89 forks source link

confidence scores output from the LM #57

Open Jiltseb opened 2 years ago

Jiltseb commented 2 years ago

Is there a way to get the confidence scores (word/sub-word level) also as the output? with decode_beams, it is possible to get the time information for alignment purposes and KenLM state, in addition to the segment level probabilities. It will be a nice addition if word-level confidence scores are also shown. Since this is calculated based on AM and LM (and optionally hotwords), we can do fine-grained analysis at the word level to remove or emphasize some words, as desired.

gkucsko commented 2 years ago

Hi, thanks for the question. As for the AM we decided to not include confidences out of the box, since there is no unique way to calculate them. Using the frame level annotations and averaging the probabilities or similar is probably the best bet here. As for respecting the LM and hotwords it gets a bit more complicated since neither are really normalized in a good way and it would probably depend heavily on the downstream task. Open to suggestions though if you have a strong use case

Jiltseb commented 2 years ago

Hi, @gkucsko Thank you very much for your reply. I can get the confidence from e2e AM by averaging the frame-level probabilities as you mentioned. But with LM, understanding the confidence with which a word is predicted could shed light on the contribution of LM (not just perplexity) and help us to decide if a particular word is suitable to process further in SLU tasks. If the contribution from individual modules can be segregated at the word level, there should be a way to track back the individual word confidences from the top beam.

patrickvonplaten commented 2 years ago

I'd also be very interested in this addition!

I think it should be relatively easy to additionally return the lm_score + am_score that pyctcdecode gives each word no? Not sure if I understand the code a 100%, but this line here: https://github.com/kensho-technologies/pyctcdecode/blob/9071d5091387579b4722cfcbe0c8597ad0b16227/pyctcdecode/decoder.py#L326 defines the lm_score + am_score probability that is given by pyctcdecode no? -

The am_score corresponds to logit_score and if I understand correctly this is just \sum_[word_start, word_end] (log(logit[i)) and lm_score is the language model score returned by KenLM weighted by alpha and beta no? So if we could just save those scores in some kind of list that would be very helpful IMO

What do you think @gkucsko ?

patrickvonplaten commented 2 years ago

Also cc @lopez86 :-)

patrickvonplaten commented 2 years ago

The main problem with using lm_score (that is already returned here: https://github.com/kensho-technologies/pyctcdecode/blob/9071d5091387579b4722cfcbe0c8597ad0b16227/pyctcdecode/decoder.py#L498) for confidence scoring is that the score is not at all normalized on length. E.g. longer transcription would necessarily have a lower lm_score. One could normalize the score by the number of words but I wonder whether it's better to take the minimum of the words as described here.

Also related: https://discuss.huggingface.co/t/confidence-scores-self-training-for-wav2vec2-ctc-models-with-lm-pyctcdecode/17052