modelscope / FunASR

A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
https://www.funasr.com
Other
4.62k stars 514 forks source link

paraformer finetune添加新词进行训练problem? #1064

Open zw76859420 opened 8 months ago

zw76859420 commented 8 months ago

大佬,请教下: 我们有几个oov词添加到词典中进行训练,添加步骤如下: 1)词典添加(tokens.txt): 眀

瑧 2)训练使用paraformer原生代码(egs/aishell/s1/run.sh),修改如下: train.py \ --task_name asr \ --gpu_id $gpu_id \ --use_preprocessor true \ --token_type $token_type \ --token_list $token_list \ --dataset_type large \ --data_dir ${feats_dir}/data \ --train_set ${train_set} \ --valid_set ${valid_set} \ --data_file_names "wav.scp,text" \ --cmvn_file model/am.mvn \ --speed_perturb ${speed_perturb} \ --resume true \ --init_param model/model.pb \ --ignore_init_mismatch true ... 最终训练结果loss如下: 2023-11-07 12:51:26,261 (build_trainer:733) INFO: 3epoch:train:1-50batch:124num_updates: iter_time=0.732, forward_time=6.521, loss_att=7.832, acc=2.894e-05, loss_pre=0.009, loss=7.841, backward_time=0.451, optim_step_time=0.052, optim0_lr0=1.992e-06, train_time=32.939 出现了如上的问题?请问是什么原因? 期待大佬们的回复
tramphero commented 8 months ago

The current finetuning pipeline of FunASR does not support directly modifying the subword vocabulary to add OOV vocabulary for finetuning. If there is a need for this, the following modifications need to be made:

1)Modify Tokens.txt It is recommended to expand it further.

2)Modify model.pb After adding modeling units, the output layer of the model needs to be expanded accordingly, and the connections of the newly added modeling units are initialized randomly.

3)Finetuning the model Since OOV vocabulary has been added, the training data needs to have sufficient coverage so that it can be recognized after training.

Hope it will be helpful!

zw76859420 commented 8 months ago

多谢良博即时解答,明白了,期待Funasr越来越好!构建完整ASR生态