There is a problem about training a conformer+RNN-T model?

sooftware / conformer

[Unofficial] PyTorch implementation of "Conformer: Convolution-augmented Transformer for Speech Recognition" (INTERSPEECH 2020)

Apache License 2.0

958 stars 175 forks source link

There is a problem about training a conformer+RNN-T model? #38

Open scufan1990 opened 2 years ago

scufan1990 commented 2 years ago

Hi, There is a problem about training a conformer+RNN-T model. How about the cer and wer with one GPU?

I'm train the model on one RTX TITAN GPU, training the conformer(encoder layers 16, encoder dim 144, decoder layer 1, decoder dim 320), after 50 epoch training the CER is about 27 and don't reduce anymore.

wszyy commented 2 years ago

Hello, I meet the same problem as you, but I use the Conformer Encoder and Transformer Decoder. By the way, do you solve the problem about the output of DecoderRNNT? It's combined with 4 dimensions, how to use it to recognize speech?

jingzhang0909 commented 2 years ago

Could you tell me what dataset you use in your training？How long it would use to train a ckpt? I find dataset Librispeech with 970 hours in paper. It seems that will cost a lot of time in training.

wszyy commented 2 years ago

Um, I use the aishell-1, training beyond 10 hours, but the effects is not very well. Actually, I use the Google Colab to train the model, it really takes a lot of time. By the way, do you understand the 4 dimensions results? The auther just use torch.cat to connect the encoder_output matrix and decoder_output matrix, it seems that the network can not be used to recognize speech. So, I build two networks: 1、Conformer's encoder and Transformer's decoder 2、Conformer's encoder and LSTM decoder with attention mechanism. Now, I have been training the two network for several days.

jingzhang0909 commented 2 years ago

Um, I use the aishell-1, training beyond 10 hours, but the effects is not very well. Actually, I use the Google Colab to train the model, it really takes a lot of time. By the way, do you understand the 4 dimensions results? The auther just use torch.cat to connect the encoder_output matrix and decoder_output matrix, it seems that the network can not be used to recognize speech. So, I build two networks: 1、Conformer's encoder and Transformer's decoder 2、Conformer's encoder and LSTM decoder with attention mechanism. Now, I have been training the two network for several days.

Thanks for your reply! I have not decide the model and dataset which to use yet. I would like to share with you if there is some futher info.

wszyy commented 2 years ago

That will be OK, I'm also need to communicate with other to know more about the network. Do you come from china? Maybe we can change the contact.

wanglongR commented 1 year ago

That will be OK, I'm also need to communicate with other to know more about the network. Do you come from china? Maybe we can change the contact.

hello wszyy，I come from China. I have been learning about conformer's model recently and would like to communicate with you about it. If you are willing, you can add my wechat, ID: scrushy518