Open fujimotos opened 1 year ago
この前ESPnet2 Librispeechのレシピを使ってreazonspeech medium (500h~)を元で31epochの訓練を走ってみました。ログは以下です(まだサチっていないよう):
2023-02-16 07:43:47,882 (trainer:338) INFO: 31epoch results:
[train] iter_time=2.915e-04, forward_time=0.100, loss_ctc=40.290, loss_att=30.836, acc=0.690, loss=33.672,
backward_time=0.117, optim_step_time=0.083, optim0_lr0=2.697e-04, train_time=28.888, time=1 hour, 25 minutes and
30.11 seconds, total_count=550157, gpu_max_cached_mem_GB=4.861,
[valid] loss_ctc=21.431, cer_ctc=0.259, loss_att=16.760, acc=0.834, cer=0.222, wer=0.849, loss=18.161, time=44.84
seconds, total_count=3255, gpu_max_cached_mem_GB=4.861,
Loss | CER |
---|---|
参考として、今のconformer-transformerモデル(パラメーターが変わりますが)はこういう感じです。
2023-02-11 03:22:58,191 (trainer:338) INFO: 31epoch results:
[train] iter_time=2.541e-04, forward_time=0.077, loss_ctc=31.263, loss_att=17.444, acc=0.787, loss=21.590,
backward_time=0.063, optim_step_time=0.057, optim0_lr0=7.346e-04, train_time=6.864, time=34 minutes and 31.61
seconds, total_count=280519, gpu_max_cached_mem_GB=4.801,
[valid] loss_ctc=22.093, cer_ctc=0.266, loss_att=12.771, acc=0.859, cer=0.194, wer=0.799, loss=15.567, time=12.59
seconds, total_count=1674, gpu_max_cached_mem_GB=4.801, [att_plot] time=1 minute and 6.61 seconds, total_count=0,
gpu_max_cached_mem_GB=4.801
今のところ大規模で回す計画はないですが、branchformerの実験に関して何か進捗があったらまたここに貼らせていただきます。
@pyf98, maybe you can help them. You can translate this into English (or Chinese).
I think their learning rate is too low in this scenario, or there is something wrong with the actual batchsize (with multiple GPUs or gradient accumulation).
I'm not sure what Conformer and E-Branchformer configs are being used exactly. I feel some configs might have issues.
The Conformer config provided above has 12 layers without Macaron FFN. The input layer downsamples 6 times. These are different from the configs in other recipes (e.g., LibriSpeech). If you simply use the same E-Branchformer config from LibriSpeech, there can be some issues. For example, the model can be much larger.
In our experiments, we scale Conformer and E-Branchformer to have similar parameter counts. In such cases, we usually do not need to tune the training hyper-parameters again. We have added E-Branchformer configs and results in many other ESPnet2 recipes covering various types of speech.
@pyf98 @sw005320 Thanks for your input! The experiment above was conducted with this config on 500h~ of data. The E-Branchformer model has 145M params and the Conformer used for comparison has 91M. (btw, In our latest released conformer model we enabled Macaron FFN)
Will check the lr/accum_grads/multi-gpu/downsampling configurations and other recipes as well when we run more experiments on larger dataset!
Thanks for the information. When comparing these models (E-Branchformer vs Conformer), we typically just replaced the encoder config (at a similar model size) but kept the other training configs the same. This worked well in general.
チケットのゴール
参考リンク