【librispeech】No converge? Is it possible to release model configs or trained models?

sean186 commented 4 years ago

HI, I run the libr/run.sh demo ,but the loss is still so large, the model can't converge.Can you help me? Is it possible to release model configs or trained models? my env: pytorch 1.5 cuda 10.1 python3.7 run python3 steps/train.py --lr=0.001 --output_unit=72 --lamb=0.001 --data_path=$dir --batch_size=256 loss:

libri $ cat log_train_1.log |grep "mean"
mean_cv_loss: 687.0884070194129
mean_cv_loss: 1222.706387606534
mean_cv_loss: 1671.5333037405303
mean_cv_loss: 1874.3074100378788
mean_cv_loss: 1370.3589680989583
mean_cv_loss: 1030.87519679214
mean_cv_loss: 1022.6952799479167
mean_cv_loss: 816.0803873697917
mean_cv_loss: 1366.5945001775567
mean_cv_loss: 1273.6065474076704
mean_cv_loss: 1495.3210212476326
mean_cv_loss: 1720.553447561553
mean_cv_loss: 1175.598328006629
mean_cv_loss: 2410.611476089015
mean_cv_loss: 1572.0060176964962
mean_cv_loss: 1302.1010786576705
mean_cv_loss: 1690.6941273082386
mean_cv_loss: 1641.5495087594697
mean_cv_loss: 1623.9612718986743

Look forward to your reply. Thank you

FzuGsr commented 4 years ago

Hi, I have the same problem!! I used the train model to predict train data, and the predict is totally incorrect !! like this:

@thu-spmi @aky15 Is there a problem with the model? Thank you!

aky15 commented 4 years ago

Sorry for the late reply. As we stated in the paper, the training process will be a bit unstable in the initial stage, and it usually works to increase the weight of CTC loss (lamb in train.py) to help convergence. According to my experience, a CTC weight of 0.3 should ensure convergence for librispeech, even the batchsize is small (e.g. 64 or 32), but the result may be a little worse. You can adjust the CTC weight manually after the first few train steps for better results, and I am preparing to design an automatic way to gradually reduce the weight of CTC as the training progresses.

FzuGsr commented 4 years ago

Thank you! My loss has dropped, using VGGBLSTM, Librispeech 1000h, but the accuracy is less than 4.09% of the test-clean data set.

decode_dev_clean/lattice  %WER 6.71 [ 3650 / 54402, 273 ins, 843 del, 2534 sub ]
decode_dev_other/lattice  %WER 15.17 [ 7730 / 50948, 615 ins, 1525 del, 5590 sub ]
decode_test_clean/lattice  %WER 6.69 [ 3515 / 52576, 270 ins, 832 del, 2413 sub ]
decode_test_other/lattice  %WER 15.79 [ 8264 / 52343, 592 ins, 1784 del, 5888 sub ]

train loss is about 20, cv loss:

mean_cv_loss: 17.023623735476765 mean_cv_partial_loss-84.86912908309546
mean_cv_loss: 16.849719316531452 mean_cv_partial_loss-84.9270972227439
mean_cv_loss: 16.71052746895032 mean_cv_partial_loss-84.97349450527093
mean_cv_loss: 16.590796837439903 mean_cv_partial_loss-85.0134047157744
mean_cv_loss: 16.582249959309895 mean_cv_partial_loss-85.01625367515108
mean_cv_loss: 16.675133142715847 mean_cv_partial_loss-84.98529261401576

params: lr:0.00001, lamb:0.01 Is there any way to achieve the accuracy of the paper？ Look forward to your reply.

aky15 commented 4 years ago

Hi! @FzuGsr ,we update the script of librispeech, removing the pruning of trigram contexts for denominator graph (line 85) and the training has become more stable. You can try the following configuration to get desirable result: model : BLSTM hdim : 512 lr : 0.001 lamb : 0.01 batch_size: 128 and notice that we use 4gram language model rescore (as detailed in the script) to get the result in the paper.

FzuGsr commented 4 years ago

Hi! @FzuGsr ,we update the script of librispeech, removing the pruning of trigram contexts for denominator graph (line 85) and the training has become more stable. You can try the following configuration to get desirable result: model : BLSTM hdim : 512 lr : 0.001 lamb : 0.01 batch_size: 128 and notice that we use 4gram language model rescore (as detailed in the script) to get the result in the paper.

Thank you for your reply! I will try later. Does this affect the calculation of loss and lead to unstable training？

aky15 commented 4 years ago

Yes, the denominator graph will affect the loss calculation.

FzuGsr commented 4 years ago

Yes, the denominator graph will affect the loss calculation.

hi @aky15 I use the other model (eg, transformer), but the loss is strange

time: 8393485.422398832, partial_loss -318.4996032714844,tr_real_loss: -87.59403991699219, lr: 0.01
training epoch: 1, step: 1
time: 0.8945837002247572, partial_loss -273.60565185546875,tr_real_loss: -33.11537170410156, lr: 0.01
training epoch: 1, step: 2
time: 0.8515760116279125, partial_loss -310.4080810546875,tr_real_loss: -65.38531494140625, lr: 0.01
training epoch: 1, step: 3
time: 0.845015412196517, partial_loss -242.25860595703125,tr_real_loss: -3.054351806640625, lr: 0.01
training epoch: 1, step: 4
time: 0.8450672645121813, partial_loss -233.09873962402344,tr_real_loss: 5.82489013671875, lr: 0.01

Loss is negative, getting smaller and smaller. costs_alpha_den << (1+lamb)*costs_ctc ?

Look forward to your reply

FzuGsr commented 4 years ago

CTC_CRF_LOSS input：

>>>labels.size()
torch.Size([201])
>>>input_lengths.size()
torch.Size([2])
>>>input_lengths
tensor([123,  95], dtype=torch.int32)
>>>label_lengths
tensor([109,  92], dtype=torch.int32)
>>>netout.size()
torch.Size([2, 131, 347])

is that right?

I found gpu_ctc has blank_label and gpu_den didn't ?

ctc_crf_base.gpu_ctc param blank_index is default 0, but the token file blank_index is 1 ?

<eps> 0
<blk> 1
<NSN> 2
<SPN> 3
AA0 4
AA1 5

a bug? Look forward to your reply

aky15 commented 4 years ago

In our default model (e.g. BLSTM, LSTM), the length of neural network output is equal to the length of input frames. If the lengths have been changed in your model (e.g. you use down/up sampling or other techniques that leads to the change of the feature lengths. In your examples, the length of neural network output is 131, while the original input has 123 frames at most), you should use the corresponding length of neural network output as the input of CTC_CRF loss.

aky15 commented 4 years ago

About the blank index, please refer to #11 .

FzuGsr commented 4 years ago

About the blank index, please refer to #11 .

Thank you for your reply. Blank index = 1 causes lexicon_number.txt phone index and token.txt are inconsistent. Can i change the blank index to others in gpu_ctc function?

FzuGsr commented 4 years ago

In our default model (e.g. BLSTM, LSTM), the length of neural network output is equal to the length of input frames. If the lengths have been changed in your model (e.g. you use down/up sampling or other techniques that leads to the change of the feature lengths. In your examples, the length of neural network output is 131, while the original input has 123 frames at most), you should use the corresponding length of neural network output as the input of CTC_CRF loss.

Yes，I use the down sampling and let the input_lengths down sampling too, but get the loss -inf

training epoch: 1, step: 1
netout torch.Size([8, 133, 365]) , labels torch.Size([833]), , input_lengths tensor([133, 131, 128, 116, 113, 113,  83,  77]), label_lengths tensor([121, 124, 105, 100, 149, 103,  70,  61], dtype=torch.int32)
netout torch.Size([8, 133, 365]) , labels torch.Size([869]), , input_lengths tensor([132, 128, 125, 121, 115, 103, 102,  66]), label_lengths tensor([111, 154, 105,  87, 100, 122, 124,  66], dtype=torch.int32)
time: 58.328730165958405, partial_loss -inf,tr_real_loss: -inf, lr: 0.001

is right for CTC_CRF loss input?

ctc_crf_base.gpu_den get -inf ?

>>>logits.size()
torch.Size([3, 163, 365])
>>>input_lengths
tensor([163, 160, 149])
>>>label_lengths
tensor([7, 6, 9], dtype=torch.int32)
>>>labels.size()
torch.Size([22])

>>>costs_ctc
tensor([-151.3362,    0.0000, -193.9635])
>>>costs_alpha_den
tensor([-3642.3738,       -inf, -3498.7480], device='cuda:1')
>>>costs_beta_den
tensor([-3642.3711,       -inf, -3498.7468], device='cuda:1')

Look forward to your reply

hyx16 commented 4 years ago

In our default model (e.g. BLSTM, LSTM), the length of neural network output is equal to the length of input frames. If the lengths have been changed in your model (e.g. you use down/up sampling or other techniques that leads to the change of the feature lengths. In your examples, the length of neural network output is 131, while the original input has 123 frames at most), you should use the corresponding length of neural network output as the input of CTC_CRF loss.

Yes，I use the down sampling and let the input_lengths down sampling too, but get the loss -inf
training epoch: 1, step: 1
netout torch.Size([8, 133, 365]) , labels torch.Size([833]), , input_lengths tensor([133, 131, 128, 116, 113, 113,  83,  77]), label_lengths tensor([121, 124, 105, 100, 149, 103,  70,  61], dtype=torch.int32)
netout torch.Size([8, 133, 365]) , labels torch.Size([869]), , input_lengths tensor([132, 128, 125, 121, 115, 103, 102,  66]), label_lengths tensor([111, 154, 105,  87, 100, 122, 124,  66], dtype=torch.int32)
time: 58.328730165958405, partial_loss -inf,tr_real_loss: -inf, lr: 0.001
is right for CTC_CRF loss input?

ctc_crf_base.gpu_den get -inf ?
>>>logits.size()
torch.Size([3, 163, 365])
>>>input_lengths
tensor([163, 160, 149])
>>>label_lengths
tensor([7, 6, 9], dtype=torch.int32)
>>>labels.size()
torch.Size([22])

>>>costs_ctc
tensor([-151.3362,    0.0000, -193.9635])
>>>costs_alpha_den
tensor([-3642.3738,       -inf, -3498.7480], device='cuda:1')
>>>costs_beta_den
tensor([-3642.3711,       -inf, -3498.7468], device='cuda:1')
Look forward to your reply

The input_length should be at least 2*label_length - 1 (roughly). To be exactly, the input_length shold be at least ctc_length(labels), where ctc_length is the function as follows:
```
def ctc_length(labels):   
needed_blank_count  = 0
for i in range(1, len(labels)):
    if labels[i] == labels[i-1]:
        needed_blank_count += 1
    return len(labels) + needed_blank_count
```
For example, if the labels are "A A B B", the input feature should have 6 frames at least. If the input_length is shorter than ctc_length(labels), the input can not walk through all the labels and the result is wrong. The gpu_ctc loss will be 0 and the gpu_den loss will be -inf.
You can change the blank_index of gpu_ctc, but you can't change the blank_index of gpu_den. If you only use gpu_ctc, you can choose your own blank_index and let the neural network output of this index be the probability of the blank label. If you want use gpu_ctc and gpu_den together, the blank index can't be changed.

FzuGsr commented 4 years ago

In our default model (e.g. BLSTM, LSTM), the length of neural network output is equal to the length of input frames. If the lengths have been changed in your model (e.g. you use down/up sampling or other techniques that leads to the change of the feature lengths. In your examples, the length of neural network output is 131, while the original input has 123 frames at most), you should use the corresponding length of neural network output as the input of CTC_CRF loss.

Yes，I use the down sampling and let the input_lengths down sampling too, but get the loss -inf
training epoch: 1, step: 1
netout torch.Size([8, 133, 365]) , labels torch.Size([833]), , input_lengths tensor([133, 131, 128, 116, 113, 113,  83,  77]), label_lengths tensor([121, 124, 105, 100, 149, 103,  70,  61], dtype=torch.int32)
netout torch.Size([8, 133, 365]) , labels torch.Size([869]), , input_lengths tensor([132, 128, 125, 121, 115, 103, 102,  66]), label_lengths tensor([111, 154, 105,  87, 100, 122, 124,  66], dtype=torch.int32)
time: 58.328730165958405, partial_loss -inf,tr_real_loss: -inf, lr: 0.001
is right for CTC_CRF loss input? ctc_crf_base.gpu_den get -inf ?
>>>logits.size()
torch.Size([3, 163, 365])
>>>input_lengths
tensor([163, 160, 149])
>>>label_lengths
tensor([7, 6, 9], dtype=torch.int32)
>>>labels.size()
torch.Size([22])

>>>costs_ctc
tensor([-151.3362,    0.0000, -193.9635])
>>>costs_alpha_den
tensor([-3642.3738,       -inf, -3498.7480], device='cuda:1')
>>>costs_beta_den
tensor([-3642.3711,       -inf, -3498.7468], device='cuda:1')
Look forward to your reply
The input_length should be at least 2*label_length - 1 (roughly). To be exactly, the input_length shold be at least ctc_length(labels), where ctc_length is the function as follows:
def ctc_length(labels):   
    needed_blank_count  = 0
    for i in range(1, len(labels)):
        if labels[i] == labels[i-1]:
            needed_blank_count += 1
        return len(labels) + needed_blank_count
For example, if the labels are "A A B B", the input feature should have 6 frames at least. If the input_length is shorter than ctc_length(labels), the input can not walk through all the labels and the result is wrong. The gpu_ctc loss will be 0 and the gpu_den loss will be -inf.

You can change the blank_index of gpu_ctc, but you can't change the blank_index of gpu_den. If you only use gpu_ctc, you can choose your own blank_index and let the neural network output of this index be the probability of the blank label. If you want use gpu_ctc and gpu_den together, the blank index can't be changed.

Thank you very much for your reply! Is the calculation result of function gpu_ctc the same as that of function torch.nn.function.ctc_loss? Does gpu_ctc use cuda to optimize the calculation speed of CTC ？

Look forward to your reply

hyx16 commented 4 years ago

Thank you very much for your reply! Is the calculation result of function gpu_ctc the same as that of function torch.nn.function.ctc_loss? Does gpu_ctc use cuda to optimize the calculation speed of CTC ？

Look forward to your reply

The gpu_ctc is modified from Baidu warp-ctc https://github.com/baidu-research/warp-ctc. We change the warp-ctc's input from logits (without log_softmax) to log_softmax. torch.nn.function.ctc_loss is pytorch's implementation of CTC. It may use CUDNN's CTC implementation internally (and the input is log_softmax). Anyway, warp-ctc and pytorch's CTC are different implementations of CTC, so they should have similar results. They both support calculation using cuda. I have not compared the speed of them and I think the CTC computation only accounts for small propotation of the whole computation (including neural network computation).

FzuGsr commented 4 years ago

Thank you very much for your reply! Is the calculation result of function gpu_ctc the same as that of function torch.nn.function.ctc_loss? Does gpu_ctc use cuda to optimize the calculation speed of CTC ？ Look forward to your reply

The gpu_ctc is modified from Baidu warp-ctc https://github.com/baidu-research/warp-ctc. We change the warp-ctc's input from logits (without log_softmax) to log_softmax. torch.nn.function.ctc_loss is pytorch's implementation of CTC. It may use CUDNN's CTC implementation internally (and the input is log_softmax). Anyway, warp-ctc and pytorch's CTC are different implementations of CTC, so they should have similar results. They both support calculation using cuda. I have not compared the speed of them and I think the CTC computation only accounts for small propotation of the whole computation (including neural network computation).

Thank you very much for your reply! You are so excellent.

thu-spmi / CAT

【librispeech】No converge? Is it possible to release model configs or trained models? #22