hirofumi0810 / neural_sp

End-to-end ASR/LM implementation with PyTorch
Apache License 2.0
596 stars 141 forks source link

Can't replicate CSJ's LAS results with the default setup. #228

Open lijianhackthon opened 3 years ago

lijianhackthon commented 3 years ago

Hi @hirofumi0810 ,

I'm trying to replicate the results of the LAS model that you shared in this table.

I'm using the default run.sh script in the csj example. Till now, I've got two sets of results with or without using external external_lm. The external_lm is also trained by this run.sh script in stage 3. I didn't change options defined in conf/asr/blstm_las.yaml or conf/lm/rnnlm.yaml. And I used one GPU for training.

Here are the results that I've got. And I used the default script local/score.sh for test. Still, I didn't change any hyper-parameters defined in this script.

Test %WER %CER
csj_eval1 7.7 6.0
csj_eval2 5.9 4.8
csj_eval3 6.4 4.9
Test %WER %CER
csj_eval1 7.7 6.0
csj_eval2 5.7 4.7
csj_eval3 6.4 5.0

And for easier reference, I'm posting your released results here,

Model eval1 eval2 eval3
LAS 6.5 5.1 5.6

Considering that your results are WERs, my results are a little worse than yours for all the eval sets no matter using external LM or not. And also, it doesn't help much by using external LM.

For your reference, here are my environment setup,

OS: Ubuntu 18.04.5 LTS
CUDA: 10.2
python: 3.7.9
pytorch: 1.6.0
warpctc-pytorch: 0.2.1

The commit hash of the neural_sp that I'm using is 7a9ec231d50a9ed6a0457a6c582526173c8ceb6b.

I want to know that whether the results you released are from using the default run.sh script. Did I miss something important?

Thanks a lot for your help.

hirofumi0810 commented 3 years ago

@lijianhackthon I think you missed SpecAugment.

lijianhackthon commented 3 years ago

@hirofumi0810 Thanks a lot for your quick reply.

Based on your suggestion, I added the SpecAugment feature into setup and ran the script again. Here are the final test results,

Test %WER %CER
csj_eval1 7.3 6.0
csj_eval2 5.3 4.2
csj_eval3 5.7 4.4

It indeed improves the accuracy compared to my old setups, which is a good sign. However, there is still gap between my CER values and your benchmark.

Now my configurations for ASR are

conf=conf/asr/blstm_las.yaml
conf2=conf/data/spec_augment_pt.yaml
asr_init=exp/csj_mdl/asr/train_nodev_all_wpbpe10000/conv2Lblstm512H5L_sumfwdbwd_chunkL-1R40_drop4_lstm1024H1L_location_ss0.2_adam_lr0.001_bs30_ls0.1_ctc0.3_3/model.epoch-25
external_lm="exp/csj_mdl/lm/train_nodev_all_vocaball_wpbpe10000/lstm1024H0P4L_emb1024_tie_residual_glu_adam_lr0.001_bs64_bptt200_dropI0.2H0.5_ls0.1/model.epoch-27"

I used conf2=conf/data/spec_augment_pt.yaml to bring in the SpecAugment and used asr_init= to specify the base model for adaptation. And the external_lm is still the RNNLM trained in stage 3 using the configuration lm_conf=conf/lm/rnnlm.yaml. The test script that I used is still the default local/score.sh with the right assignments of model= and lm=.

Did I use the SpecAugment correctly? Do you think there is still some feature that I haven't used? Thanks.

hirofumi0810 commented 3 years ago

@lijianhackthon Great. Can you decode again w/o LM fusion?

lijianhackthon commented 3 years ago

Hi @hirofumi0810 , thanks for the hint. Here are the results of decoding w/o LM fusion. By disabling the lm fusion for decoding, I just left the lm variable in local/score.sh to be empty. If that's not the way to do it, please let me know.

Setup %WER for csj_eval1 %WER for csj_eval2 %WER for csj_eval3
author's benchmark 6.5 5.1 5.6
blstm_las + decode with lm fusion 7.3 5.3 5.7
blstm_las + decode without lm fusion 7.0 5.2 5.6

Without using lm fusion for decoding, the results indeed got better. The WER value for eval3 are same as the benchmark and the WER value for eval2 is close enough to the benchmark. But there is still an obvious gap for eval1.

Just to be clear, as I mentioned in the above post, I used the external LM during the training phase of the blstm_las model. The external LM was trained in the stage 3 of the run.sh by using the default conf/lm/rnnlm.yaml.

Could you please explain a little bit on why we can get better results without using lm fusion for decoding? And also, since there is still gap for eval1 and eval2, what do you think would be the next thing to try? Thanks again for your help.

hirofumi0810 commented 3 years ago

LM on CSJ is trained with transcription only. In such a case, E2E models trained with SpecAugment do not obtain any gains. If we can use additional text for training LM, it might be helpful. So you trained LAS with cold fusion, right? I think it could lead to a condition mismatch between training and testing. You can reproduce the result if you remove LM fusion during training.

lijianhackthon commented 3 years ago

Hi @hirofumi0810 ,

I got the test results of LAS model trained without LM cold fusion. Please see the last two lines of the table below,

Setup %WER for csj_eval1 %WER for csj_eval2 %WER for csj_eval3
author's benchmark 6.5 5.1 5.6
blstm_las + train with cold fusion + decode with lm fusion 7.3 5.3 5.7
blstm_las + train with cold fusion + decode without lm fusion 7.0 5.2 5.6
blstm_las + train without cold fusion + decode with lm fusion 7.0 5.5 6.1
blstm_las + train without cold fusion + decode without lm fusion 7.2 5.4 5.9

The setup of this model is

# ASR configuration
- conf=conf/asr/blstm_las.yaml
- conf2=conf/data/spec_augment_pt.yaml
- asr_init=
- external_lm=

You can see that for this time, I leave the external_lm and asr_init empty, which means I remove both cold fusion and model initialization.

Based on the test results, training without cold fusion doesn't make any improvement. Is this because I didn't use asr model initialization? And also, did I use the correct config file for spec augment?

Thanks.

hirofumi0810 commented 3 years ago

@lijianhackthon I think you should use --conf2=conf/data/spec_augment.yaml. How many epochs did you run? You need 50 epochs.

lijianhackthon commented 3 years ago

@hirofumi0810

Thanks for your quick reply. I used the default config conf/asr/blstm_las.yaml, which I believe the item n_epochs: 25 is the setup for number of epochs.

I will use --conf2=conf/data/spec_augment.yaml and n_epochs: 50 to train a new model and let you know the results. Thanks.

lijianhackthon commented 3 years ago

@hirofumi0810

I just got the test results of the LAS model by using the setup that you recommended. I used --conf2=conf/data/spec_augment.yaml and set the n_epochs: to be 50 in the file conf/asr/blstm_las.yaml. For this time, I didn't use cold fusion or asr_init during the train process. Other than that, I left all the configs as default in the CSJ. The results are in the table below,

| Setup                                                           | %WER for csj_eval1 | %WER for csj_eval2 | %WER for csj_eval3 |
|-----------------------------------------------------------------|--------------------|--------------------|--------------------|
| author's benchmark                                              | 6.5                | 5.1                | 5.6                |
| blstm_las + spec_augment + 50 epochs + decode without lm fusion | 7.1                | 5.4                | 5.7                |
| blstm_las + spec_augment + 50 epochs + decode with lm fusion    | 6.9                | 5.5                | 6.0                |

Still, there is an obvious gap between my results and the benchmark. Is there something wrong with my setup? Do I need to use speed_perturb to get your benchmark?

It takes several days for me to try a new setup. So could you please try to give me more thorough setup so that I can replicate your benchmark? I believe this can also be helpful to other users who are new to this toolkit.

Thanks.

hirofumi0810 commented 3 years ago

@lijianhackthon I think you have already reproduced the results successfully. The best number was obtained by using character-CTC as an auxiliary task. Speed perturbation can boost the performance further (but tasks more time).