reazon-research / ReazonSpeech

Massive open Japanese speech corpus
https://research.reazon.jp/projects/ReazonSpeech/
Apache License 2.0
239 stars 18 forks source link

E-Branchformerモデルの検証 #8

Open fujimotos opened 1 year ago

fujimotos commented 1 year ago

チケットのゴール

参考リンク

euyniy commented 1 year ago

この前ESPnet2 Librispeechのレシピを使ってreazonspeech medium (500h~)を元で31epochの訓練を走ってみました。ログは以下です(まだサチっていないよう):

2023-02-16 07:43:47,882 (trainer:338) INFO: 31epoch results: 
[train] iter_time=2.915e-04, forward_time=0.100, loss_ctc=40.290, loss_att=30.836, acc=0.690, loss=33.672, 
backward_time=0.117, optim_step_time=0.083, optim0_lr0=2.697e-04, train_time=28.888, time=1 hour, 25 minutes and 
30.11 seconds, total_count=550157, gpu_max_cached_mem_GB=4.861, 
[valid] loss_ctc=21.431, cer_ctc=0.259, loss_att=16.760, acc=0.834, cer=0.222, wer=0.849, loss=18.161, time=44.84 
seconds, total_count=3255, gpu_max_cached_mem_GB=4.861, 
Loss CER
loss cer

参考として、今のconformer-transformerモデル(パラメーターが変わりますが)はこういう感じです。

2023-02-11 03:22:58,191 (trainer:338) INFO: 31epoch results: 
[train] iter_time=2.541e-04, forward_time=0.077, loss_ctc=31.263, loss_att=17.444, acc=0.787, loss=21.590, 
backward_time=0.063, optim_step_time=0.057, optim0_lr0=7.346e-04, train_time=6.864, time=34 minutes and 31.61 
seconds, total_count=280519, gpu_max_cached_mem_GB=4.801, 
[valid] loss_ctc=22.093, cer_ctc=0.266, loss_att=12.771, acc=0.859, cer=0.194, wer=0.799, loss=15.567, time=12.59 
seconds, total_count=1674, gpu_max_cached_mem_GB=4.801, [att_plot] time=1 minute and 6.61 seconds, total_count=0, 
gpu_max_cached_mem_GB=4.801

今のところ大規模で回す計画はないですが、branchformerの実験に関して何か進捗があったらまたここに貼らせていただきます。

sw005320 commented 1 year ago

@pyf98, maybe you can help them. You can translate this into English (or Chinese).

I think their learning rate is too low in this scenario, or there is something wrong with the actual batchsize (with multiple GPUs or gradient accumulation).

pyf98 commented 1 year ago

I'm not sure what Conformer and E-Branchformer configs are being used exactly. I feel some configs might have issues.

The Conformer config provided above has 12 layers without Macaron FFN. The input layer downsamples 6 times. These are different from the configs in other recipes (e.g., LibriSpeech). If you simply use the same E-Branchformer config from LibriSpeech, there can be some issues. For example, the model can be much larger.

In our experiments, we scale Conformer and E-Branchformer to have similar parameter counts. In such cases, we usually do not need to tune the training hyper-parameters again. We have added E-Branchformer configs and results in many other ESPnet2 recipes covering various types of speech.

euyniy commented 1 year ago

@pyf98 @sw005320 Thanks for your input! The experiment above was conducted with this config on 500h~ of data. The E-Branchformer model has 145M params and the Conformer used for comparison has 91M. (btw, In our latest released conformer model we enabled Macaron FFN)

Will check the lr/accum_grads/multi-gpu/downsampling configurations and other recipes as well when we run more experiments on larger dataset!

pyf98 commented 1 year ago

Thanks for the information. When comparing these models (E-Branchformer vs Conformer), we typically just replaced the encoder config (at a similar model size) but kept the other training configs the same. This worked well in general.