I want to discuss some issue regarding training DNN/CNN-CTC for speech recognition. (Wall Street Journal Corpus). I modeled output unit as characters.
I observed that CTC objective function was increasing and finally converged during training.
But I also observed that final NN outputs have clear tendency : p(blank symbol) >> p(non-blank symbol) for all speech time frame as following figure
In Alex Graves' paper, trained RNN should have high p(non-blank) at some point like following figure
Do you have same situation when you train NN-CTC for sequence labeling problem? I am suspecting that the reason is I use MLP/CNN instead of RNN, but I can't clearly explain why this can be a reason.
Any idea about this result?
CTC training make NNs first learn to predict only blanks. It might take some time for relevant predictions to appear -> adaptive LR methods like RMSProp work very well to circumvent this issue
Maybe training 1 epoch with HMM Viterbi alignments before switching to CTC would help.. from scratch it might be hard to learn to align and transcribe.
Hi all
I want to discuss some issue regarding training DNN/CNN-CTC for speech recognition. (Wall Street Journal Corpus). I modeled output unit as characters.
I observed that CTC objective function was increasing and finally converged during training.
But I also observed that final NN outputs have clear tendency : p(blank symbol) >> p(non-blank symbol) for all speech time frame as following figure
In Alex Graves' paper, trained RNN should have high p(non-blank) at some point like following figure
Do you have same situation when you train NN-CTC for sequence labeling problem? I am suspecting that the reason is I use MLP/CNN instead of RNN, but I can't clearly explain why this can be a reason.
Any idea about this result?
Thank you for reading my question.