modelscope / FunASR

A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
https://www.funasr.com
Other
6.49k stars 688 forks source link

关于中文8k训练问题 #873

Closed kelvinqin closed 7 months ago

kelvinqin commented 1 year ago

hi, I just follow aishell/paraformer configuration to try to build a 8khz model, this is the result at 68th epoch: 68epoch results: [train] iter_time=5.083e-04, forward_time=0.524, loss_ctc=8.589, loss_att=5.255, pre_loss_att=6.479, acc=0.855, loss_pre=0.535, loss=11.326, backward_time=0.260, optim_step_time=0.065, optim0_lr0=1.517e-04, train_time=0.986, time=1 hour, 19 minutes and 17.85 seconds, total_count=328168, gpu_max_cached_mem_GB=39.607, [valid] loss_ctc=6.268, cer_ctc=0.128, loss_att=3.646, pre_loss_att=5.071, acc=0.909, cer=0.082, wer=0.429, loss_pre=0.414, loss=8.397, time=9 seconds, total_count=1088, gpu_max_cached_mem_GB=39.607

But very strange, when I test the model at 68th epoch on dev set using asr_inference_launch, I got a very bad result: %WER 86.36 [ 13158 / 15236, 94 ins, 2903 del, 10161 sub ] %SER 99.22 [ 1017 / 1025 ] Scored 1025 sentences, 0 not present in hyp.

According to the training.log, as you can see, the validation accuracy is 0.909, but why I got such worse WER when I use asr_inference_launch on the same dataset, ? Is there any hints for me to fix this?

My training data is in 8khz, roughly 300 hours of telephony conversational data in mandarin.

My only change to the aishell/paraformer/conf/train_asr_paraformer_conformer_12e_6d_2048_256.yaml lies in line 35: I change "fs: 16000" as "fs: 8000" Thanks!

hnluo commented 1 year ago

Please provide some badcases to find the problem.