gheyret / uyghur-asr-transformer

Speech Recognition for Uyghur using Speech transformer
17 stars 2 forks source link

W2Llayer #1

Open kelvinqin opened 2 years ago

kelvinqin commented 2 years ago

Dear Gheyret, Thanks for your work.

I spent some time today to try to figure out the source of this feature extraction layer, can you point me the paper/any reference on it?

I think it is a great design to extract speech features, so just want to understand it more deeply,

Thanks a lot,

Kelvin

gheyret commented 2 years ago

Hi @kelvinqin Thanks for your comment.

W2Layer is almoust combination of deepspeech2 and Wave2Letter models. Unlike deepspeech2, I added one RNN layer after VGG extraction block and one RNN block between the layers of the Wave2Letter model.

kelvinqin commented 2 years ago

Dear Gheyret, Thanks so much for your reply, now I understood why I did not figure out its "source", it turns out that you are the source :-) Great work, and your work is great because the code is much clear than other "deeply wrapped" framework,

So W2Llayer means "Wave 2 Letter layer", is that correct please?

BTW, do you have plan to build a conformer system?

A small question --- in training, why you don't use encoder's output to calculate CTC_loss, instead, you use W2Llayer's output to calculate the CTC_loss? (See train.py, line 130) Not sure if it is a good idea to calculate it using "encoder_padded_outputs"?

Best regards, Kelvin