kaituoxu / Speech-Transformer

A PyTorch implementation of Speech Transformer, an End-to-End ASR with Transformer network on Mandarin Chinese.
769 stars 196 forks source link

question about loss curve #16

Closed xingchensong closed 5 years ago

xingchensong commented 5 years ago

Hi, kaituo ,i'm trying to train this network on librispeech ,the loss curve of epoch 1 shows that the model tends to saturate after first few steps(there has approximately 4k iters per epoch,and my loss has dropped from 4 to 3 after 100 iters and then stays the same.) I have not made any changes to the model. The only change i do is to use my own dataloader (for loading librispeech corpus) . so i wander if u have the same trend of loss-decline on traning aishell corpus?

xingchensong commented 5 years ago

newplot here is my loss curve of epoch1

kaituoxu commented 5 years ago

This is very normal. Just keep training and see the loss how changed epoch by epoch but not iters by iters.

If the final model don't work fine, maybe you need to try different "k" which effects learning rate.

Thx.

xingchensong commented 5 years ago

@kaituoxu tt

hi,kaituo,i use a small dataset(nearly 128 librispeech wavs) to test whether the model can converge over it . the figure shows that the loss remains unchanged after converging to 3.0 in the early iters(which is similar to the picture i commented before) , After about 150 epochs, the loss starts falling again (but cv loss begin increase which means overfitting).

this loss curve is so wield! and i have tried a lot of different hyperparameters(such as k, warmup_steps, n_layers_enc, d_model .etc)they all appears to converge to 3.0 and stay the same. This has been bothering me for a long time. I don't know what went wrong : (

besides, my input is same to you (except fbank feature extracted by librosa, not by kaldi),my labels contain 26 lowercase letters and space_tok , unknown_tok, start_tok and end_tok. i use batch_size to genarate one batch data instead of batch_frames.

xingchensong commented 5 years ago

This is very normal. Just keep training and see the loss how changed epoch by epoch but not iters by iters.

If the final model don't work fine, maybe you need to try different "k" which effects learning rate.

Thx.

another question is , for a small dataset(128 wavs), it takes few iters to converge to 3.0 and takes 300 epochs to decrease again , is that normal?

kaituoxu commented 5 years ago

@stephen-song , if you want to overfit 128 wavs, first of all, close all regularization, such as L2, dropout, label smoothing, then train you models again. Besides, try different "k", it is very very very important for the model to converge.

kaituoxu commented 5 years ago

Hi @stephen-song, how about the result?

xingchensong commented 5 years ago

Hi @stephen-song, how about the result?

hi, kaituo, this model can finally overfit 128 librispeech wavs , k and batch_size(or batch_frames) are truely important to make it work(just as u mentioned) . fine-tune those hyper params on whole Librispeech dataset is not time-worthy for me , so now i use aishell instead and focus on modifying the model(such as adding 2DAttention mentioned in paper [1]).

kaituoxu commented 5 years ago

Okay, thanks for your response :)

martin-radfar commented 4 years ago

@xingchensong
I have a similar problem. can you share the value you used for the parameteres. a snapshot of run.sh would be great.