Closed jcgeo9 closed 2 years ago
Output length means the speech signal's length. (It became 1/4 by the conv module.)
The target length is the length of the character uttered in this speech signal. These two are different things.
Taking the CTC below as an example, the length of the voice is very long, but the hello spoken here is five letters.
This can be processed in a sequence tagging method like CTC, or a decoder can be attached separately to predict one by one.
Try converting the actual audio signal into spectrogram and input it. Is "loss" negative?
Output length means the speech signal's length. (It became 1/4 by the conv module.) The target length is the length of the character uttered in this speech signal. These two are different things. Taking the CTC below as an example, the length of the voice is very long, but the hello spoken here is five letters. This can be processed in a sequence tagging method like CTC, or a decoder can be attached separately to predict one by one.
@sooftware so I either use the CTC loss either a single layer LSTM Decoder (as proposed in the paper)? What if I want to check the accuracy? e.g take the predicted word sequence vs the target
Try converting the actual audio signal into spectrogram and input it. Is "loss" negative?
I am converting the audio files in MelSpec before adding them to my dataloaders
CTC Loss was not used in the Conformer paper.
If you have a decoder separately, you can use Cross Entry Loss instead of CTC Loss.
If you want to check the accuracy, it would be nice to study speech recognition model training.
@sooftware in the paper they are using a single layer LSTM Decoder
so i need to construct one to feed the outputs of the encoder to produce results?
Yes, but as far as I remember, in Conformer paper, they used LSTM transducer decoder.
yes, do you have any source i can find more about it?
@sooftware Can you kindly explain to me why the output lengths and targets are so different? :/ (also in outputs I get negative floats). Example shown below
The outputs are of shape [32,490,16121] (where 16121 is the len of my vocab) What is the 490 dimensions Also the outputs are probabilities right?
I am using the following code for training and evaluation