k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
902 stars 287 forks source link

fine-tune problem #1670

Closed PeaceAndJoyAaron closed 4 weeks ago

PeaceAndJoyAaron commented 3 months ago

When I tried fine-tuning, I found that the WER value after fine-tuning was over 100, which should be a problem with the fine-tuning. Below is a simple fine-tuning log document of mine, I'm not sure if it can reflect the problem. Looking forward to your reply log-train-2024-06-26-05-32-23.txt

marcoyang1998 commented 3 months ago

Which model are you fine-tuning? Could you please show your fine-tune command?

PeaceAndJoyAaron commented 3 months ago

./pruned_transducer_stateless7_streaming/finetune.py --world-size 1 --num-epochs 20 --start-epoch 1 --exp-dir pruned_transducer_stateless7_streaming/exp_giga_finetune --use-fp16 1 --base-lr 0.015 --lr-epochs 100 --lr-batches 100000 --bpe-model k2fsa-zipformer-chinese-english-mixed/data/lang_char_bpe/bpe.model --do-finetune True --use-mux False --finetune-ckpt k2fsa-zipformer-chinese-english-mixed/exp/pretrained.pt --max-duration 150 --finetune-ckpt k2fsa-zipformer-chinese-english-mixed/exp/pretrained.pt The above is my fine-tuning command The following is the model link that I want to fine tune:https://hf-mirror.com/csukuangfj/k2fsa-zipformer-chinese-english-mixed

marcoyang1998 commented 3 months ago

It seems that you are using a very small max_duration for fine-tuning. This will hurt the performance.

What data are you using for fine-tuning?

PeaceAndJoyAaron commented 3 months ago

I want to use approximately 500 WAV files to fine tune the model

PeaceAndJoyAaron commented 3 months ago

I referred to: https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless7/finetune.py The fine-tuning operation performed by this script only modified the jsonl.gz in the train_cuts and other methods of GigaSpeechAsrDataModule to the feature file generated by my file. I am not sure if this operation is correct

danpovey commented 3 months ago

Your learning rate is way too high I think. 1.5e-02 or something. I'd try at least 10 or 20 times lower.

PeaceAndJoyAaron commented 3 months ago

I adjusted my fine-tuning command, and currently the command is: ./pruned_transducer_stateless7_streaming/finetune.py --world-size 1 --num-epochs 20 --start-epoch 1 --exp-dir pruned_transducer_stateless7_streaming/exp_giga_finetune --use-fp16 1 --base-lr 0.001 --lr-epochs 100 --lr-batches 100000 --bpe-model k2fsa-zipformer-chinese-english-mixed/data/lang_char_bpe/bpe.model --do-finetune True --use-mux False --finetune-ckpt k2fsa-zipformer-chinese-english-mixed/exp/pretrained.pt --max-duration 150 --finetune-ckpt k2fsa-zipformer-chinese-english-mixed/exp/pretrained.pt But the effect is the same

PeaceAndJoyAaron commented 3 months ago

It seems that you are using a very small max_duration for fine-tuning. This will hurt the performance.

What data are you using for fine-tuning?

I have discovered an issue and I don't know why. When I set the max duration value to 150, the script can run. However, if I set it to 500, the "for batch_idx, batch in enumerate (train_dl)" loop in the train_one_epoch method of the script will not execute. What is the reason for this? Looking forward to your reply

marcoyang1998 commented 3 months ago

As you mentioned, your dataset is too small (only 500 wav files). Your data may be too small for a large max_duration.

pzelasko commented 3 months ago

If you set drop_last is to False it will return a single batch with all available data in your dataset per epoch.

daocunyang commented 3 months ago

As you mentioned, your dataset is too small (only 500 wav files). Your data may be too small for a large max_duration.

Just curious, in order to fine-tune a model like sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20, how much data (in terms of number of files, or total duration) are needed? We are trying to finetune it with 8k data, in order to improve its accuracy on phone conversation scenario.

marcoyang1998 commented 3 months ago

@daocunyang Ideally as much data as possible. I would say a minimum of 50 hours. Remember to resample your 8k data to 16k when extracting the features.

daocunyang commented 3 months ago

@marcoyang1998 Thanks. A couple of questions: 1) For the resampling part, does it mean that once we are done with fine-tuning, and when we use it for real-time ASR for phone scenario, we also need to resample all data (8k) to 16k?

2) Also I wonder, do you have any recommendation on open-source online ASR model that's trained specifically with 8k data? If there is one, maybe it's better for us to finetune a model that's been trained on 8k data, instead of one that's trained only with 16k data.

marcoyang1998 commented 3 months ago

For the resampling part, does it mean that once we are done with fine-tuning, and when we use it for real-time ASR for phone scenario, we also need to resample all data (8k) to 16k?

Yes.

Also I wonder, do you have any recommendation on open-source online ASR model that's trained specifically with 8k data? If there is one, maybe it's better for us to finetune a model that's been trained on 8k data, instead of one that's trained only with 16k data.

I'm not familiar with this. As far as I know, most mainstream models are trained with 16kHz audio.