openspeech-team / openspeech

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.
https://openspeech-team.github.io/openspeech/
MIT License
670 stars 112 forks source link

Handling `<blank>` token in CTC-based model for decoding #209

Closed EomSooHwan closed 12 months ago

EomSooHwan commented 1 year ago

❓ Questions & Help

I am wondering how exactly the model is handling <blank> tokens from predicted index for WER and CER input.

Details

I am currently training a self-implemented custom CTC-loss based Conformer encoder-only model, using LibriSpeech train-clean-100 dataset. However, my model's CER and WER are not going down below 1.0 even after 10 epochs. I found that sp.model itself returns sentence with "<blank>" when the input of sp.DecodeIds() contains blank_id, and I am wondering if this is the reason for the current issue. I am looking through OpenspeechCTCModel, ErrorRate, and LibriSpeechSubwordTokenizer, but I could not find the part where the model handles collapses <blank> token before or during decoding. Could you check if the model has such kind of part please?

EomSooHwan commented 1 year ago

I found that for some reason the model prediction is all 0. I am not sure why, but I think that is the reason why the WER and CER looked like that.

EomSooHwan commented 1 year ago

I tried decreasing the vocab size of subword tokenizer but the result was still the same. WER and CER converges to 1. I also tried using character tokenizer and this time the CER went below 1. However it soon converged to 0.85 and also WER just kept increasing throughout the training.

EomSooHwan commented 1 year ago

I am using torch 1.10.2+cu102 version, and it seems that "to use CuDNN, the following must be satisfied: targets must be in concatenated format, all input_lengths must be T. blank=0, target_lengths ≤256, the integer arguments must be of dtype torch.int32." However, the blank index in this LibriSpeech subword tokenizer is 4. I am suspecting this as a reason. https://pytorch.org/docs/1.10/generated/torch.nn.CTCLoss.html?highlight=ctc#torch.nn.CTCLoss

EomSooHwan commented 1 year ago

I think consecutive symbols and blank symbols are not ignored during the train_cer train_wer valid_cer valid_wer metric calculation. Also, several research do say that symbols collapsing to <blank> during initial stage is natural in CTC-based learning. (though I still think it is odd to see it happening for quite a long time)

EomSooHwan commented 12 months ago

I still have not entirely resolved the predict-only-blank phenomenon, but other than that, I think I solved handling consecutive symbols and blank symbols during decoding.