lumaku / ctc-segmentation

Segment an audio file and obtain utterance alignments. (Python package)
Apache License 2.0
321 stars 29 forks source link

Is there any scale for confidence scores? #31

Closed Aditya3107 closed 11 months ago

Aditya3107 commented 1 year ago

@lumaku Thank you so much for your incredible work. I've been working on obtaining word-level confidence scores for different ASR tools, aiming to compare Whisper and Wav2vec2 models. However, I've noticed some differences in the confidence scores between the two models. For example:

With Wav2vec2 using the CTC-Segmentation algorithm, I obtained the following word-level confidence: ["cómo":0.000, "puedo":0.646, "ayudarte":0.455]

Using Whisper with DTW (via the Whisper-Timestamped library), I obtained the following word-level confidence: ["¿Cómo":0.869, "puedo":0.998, "ayudarte?":0.999]

I understand that Wav2vec2 and Whisper have distinct architectures, with Whisper not following the CTC loss, which makes a direct comparison challenging. Is there a method or approach I can use to ensure a meaningful comparison of word-level confidence scores between these two ASRs? It would be great if you can guide me on this.

lumaku commented 1 year ago

The theoretical basis for both CTC and DTW are described using HMMs. Both approaches calculate the Hidden Markov Model probability of producing a sequence.

CTC is conditioned to produce HMM-like state probabilites that you can directly use to estimate the probability of a hypothesis label sequence. Here, the CTC segmentation confidence score is not suitable for your use-case, as it is built on top of these HMM label probabilites, using a worst-case metric, and is designed to "catch" label-data mismatches in longer sentences. I would therefore use the CTC output probabilities directly (in the variable lpz) to estimate the production probability. To obtain a probability of a word, cut out the segment with the word in it, and from the segment, calculate the word production probability with the HMM formula.

Please note that the output probabilites strongly depend on the model and on the architecture. The output of networks with a better CTC output aligns more accurate. For example, in comparison to RNN models, the output neuron activations of Transformers exhibit a stronger "spike".

Aditya3107 commented 1 year ago

Thank you for the detailed answer, @lumaku. From what I understand, the confidence measure in CTC segmentation indicates how well the audio aligns with the transcript, and it's not a measure of word confidence.

Regarding the lpz variable, it contains logits with the shape [time steps or frames in input audio sequence, length of vocab]. If I'm correct, we can easily cut out words from the sequence using start/end word offsets and then extract the probability of phonemes over these offsets. Is there a specific reason for using the HMM formula to calculate word production probability? Would it be acceptable to simply take the mean of phoneme probabilities that form the word, while discarding probabilities associated with [PAD] tokens?

Also, you mentioned that output probabilities depend on the model and architecture. Do you think it's a good idea to compare output probabilities if both ASRs use transformer architectures? For example, wav2vec 2.0 uses an encoder-only transformer, while Whisper uses an encoder-decoder architecture.

lumaku commented 1 year ago

Your understanding is correct. Simply using the mean value of the token probabilities would not account for the correct sequential ordering. It would be better to calculate the forward probability from the CTC token (state) probabilities. The forward probability can be read out from the trellis diagram, a simple implementation can be found in the Pytorch alignment tutorial.

If you want to use / compare wav2vec 2.0 and whisper ASR, I think that this depends on your use-case and how accurate your probability estimation needs to be. Attention-based estimation of word probabilities will yield values that are based on different assumptions: Attention decoders are trained to produce token sequences. The whisper model uses a sequential attention decoder, that only predicts the next-in-sequence token probability. The attention decoder also learned sequential token probabilites, and it serves additionally as a language model, not just as an acoustic model. Estimated attention decoder probabilities include the probability of token occurrence and not just its acoustic pronouciation. You could try to normalize by a LM model probability to eliminate this part, similar to conventional hybrid HMM/DNN ASR, but still: This probability is strongly influenced by the training set, and there are many issues with unseen sequences; see the topic teacher forcing and scheduled sampling.