Open shuaijiang opened 12 months ago
Hey @shuaijiang! For training Distil-Whisper, you can convert any custom dataset to Hugging Face Datasets' format using this guide: https://huggingface.co/docs/datasets/audio_dataset
So long as you can load your dataset as a Python object, you can convert it to Hugging Face Datasets!
Hey @sanchit-gandhi! If I want to train another language such as Chinese,How many dataset I need to ready?
Hey @wntg - there's some detailed information about the amount of data you need for each training method at the end of this README: https://github.com/huggingface/distil-whisper/tree/main/training#overview-of-training-methods
Hey @sanchit-gandhi! If I want to train another language such as Chinese,How many dataset I need to ready?
In my exp, 1000~2000 hours high quality Chinese speech data improve a lot, maybe cer from 20 to 10. 10000 hours speech data also helps, maybe cer from 10 to 5. Addationly, fine-tuning all parameters seems better than LORA, you can ref to https://github.com/shuaijiang/Whisper-Finetune/blob/master/finetune_all.py
Hey @shuaijiang! For training Distil-Whisper, you can convert any custom dataset to Hugging Face Datasets' format using this guide: https://huggingface.co/docs/datasets/audio_dataset
So long as you can load your dataset as a Python object, you can convert it to Hugging Face Datasets!
thanks, I will try
wenet enables (full-parameter) fine-tuning of the whisper-large model in approximately 10 hours on the aishell-1 dataset, with 40 epochs and 8 * 3090 compute resources.
For more information, refer to the aishell-1 recipe available at https://github.com/wenet-e2e/wenet/tree/main/examples/aishell/whisper
I believe that using wenet will simplify the creation of local speech datasets.
Furthermore, it is significantly easier to streaming whisper by fine-tuning it under wenet's u2++ framework. Simply treat whisper as a large transformer model and leverage all existing wenet functionality (like chunk_mask, ctc-aed hybrid loss and so on). Please see https://github.com/wenet-e2e/wenet/pull/2141 for more details.
Definitely more data will help here! I left some recommendations in the README: https://github.com/huggingface/distil-whisper/tree/main/training#overview-of-training-methods
Really cool to see that you've been working on Chinese - excited to see the model you train 🚀 Let me know how you get on @shuaijiang!
distil-whisper load dataset such as common_voice which can be accessed on huggingface. But loading the private speech dataset is not supported.
I implement one method to load local speech dataset( json file), it just works, not prefect, https://github.com/shuaijiang/distil-whisper/blob/main/training/run_distillation_local_datasets.py