Closed PeaceAndJoyAaron closed 4 weeks ago
Which model are you fine-tuning? Could you please show your fine-tune command?
./pruned_transducer_stateless7_streaming/finetune.py --world-size 1 --num-epochs 20 --start-epoch 1 --exp-dir pruned_transducer_stateless7_streaming/exp_giga_finetune --use-fp16 1 --base-lr 0.015 --lr-epochs 100 --lr-batches 100000 --bpe-model k2fsa-zipformer-chinese-english-mixed/data/lang_char_bpe/bpe.model --do-finetune True --use-mux False --finetune-ckpt k2fsa-zipformer-chinese-english-mixed/exp/pretrained.pt --max-duration 150 --finetune-ckpt k2fsa-zipformer-chinese-english-mixed/exp/pretrained.pt The above is my fine-tuning command The following is the model link that I want to fine tune:https://hf-mirror.com/csukuangfj/k2fsa-zipformer-chinese-english-mixed
It seems that you are using a very small max_duration
for fine-tuning. This will hurt the performance.
What data are you using for fine-tuning?
I want to use approximately 500 WAV files to fine tune the model
I referred to: https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless7/finetune.py The fine-tuning operation performed by this script only modified the jsonl.gz in the train_cuts and other methods of GigaSpeechAsrDataModule to the feature file generated by my file. I am not sure if this operation is correct
Your learning rate is way too high I think. 1.5e-02 or something. I'd try at least 10 or 20 times lower.
I adjusted my fine-tuning command, and currently the command is: ./pruned_transducer_stateless7_streaming/finetune.py --world-size 1 --num-epochs 20 --start-epoch 1 --exp-dir pruned_transducer_stateless7_streaming/exp_giga_finetune --use-fp16 1 --base-lr 0.001 --lr-epochs 100 --lr-batches 100000 --bpe-model k2fsa-zipformer-chinese-english-mixed/data/lang_char_bpe/bpe.model --do-finetune True --use-mux False --finetune-ckpt k2fsa-zipformer-chinese-english-mixed/exp/pretrained.pt --max-duration 150 --finetune-ckpt k2fsa-zipformer-chinese-english-mixed/exp/pretrained.pt But the effect is the same
It seems that you are using a very small
max_duration
for fine-tuning. This will hurt the performance.What data are you using for fine-tuning?
I have discovered an issue and I don't know why. When I set the max duration value to 150, the script can run. However, if I set it to 500, the "for batch_idx, batch in enumerate (train_dl)" loop in the train_one_epoch method of the script will not execute. What is the reason for this? Looking forward to your reply
As you mentioned, your dataset is too small (only 500 wav files). Your data may be too small for a large max_duration.
If you set drop_last is to False it will return a single batch with all available data in your dataset per epoch.
As you mentioned, your dataset is too small (only 500 wav files). Your data may be too small for a large max_duration.
Just curious, in order to fine-tune a model like sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20, how much data (in terms of number of files, or total duration) are needed? We are trying to finetune it with 8k data, in order to improve its accuracy on phone conversation scenario.
@daocunyang Ideally as much data as possible. I would say a minimum of 50 hours. Remember to resample your 8k data to 16k when extracting the features.
@marcoyang1998 Thanks. A couple of questions: 1) For the resampling part, does it mean that once we are done with fine-tuning, and when we use it for real-time ASR for phone scenario, we also need to resample all data (8k) to 16k?
2) Also I wonder, do you have any recommendation on open-source online ASR model that's trained specifically with 8k data? If there is one, maybe it's better for us to finetune a model that's been trained on 8k data, instead of one that's trained only with 16k data.
For the resampling part, does it mean that once we are done with fine-tuning, and when we use it for real-time ASR for phone scenario, we also need to resample all data (8k) to 16k?
Yes.
Also I wonder, do you have any recommendation on open-source online ASR model that's trained specifically with 8k data? If there is one, maybe it's better for us to finetune a model that's been trained on 8k data, instead of one that's trained only with 16k data.
I'm not familiar with this. As far as I know, most mainstream models are trained with 16kHz audio.
When I tried fine-tuning, I found that the WER value after fine-tuning was over 100, which should be a problem with the fine-tuning. Below is a simple fine-tuning log document of mine, I'm not sure if it can reflect the problem. Looking forward to your reply log-train-2024-06-26-05-32-23.txt