huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
MIT License
3.33k stars 238 forks source link

Could distil-whisper load local speech dataset? #50

Open shuaijiang opened 6 months ago

shuaijiang commented 6 months ago

distil-whisper load dataset such as common_voice which can be accessed on huggingface. But loading the private speech dataset is not supported.

I implement one method to load local speech dataset( json file), it just works, not prefect, https://github.com/shuaijiang/distil-whisper/blob/main/training/run_distillation_local_datasets.py

sanchit-gandhi commented 6 months ago

Hey @shuaijiang! For training Distil-Whisper, you can convert any custom dataset to Hugging Face Datasets' format using this guide: https://huggingface.co/docs/datasets/audio_dataset

So long as you can load your dataset as a Python object, you can convert it to Hugging Face Datasets!

wntg commented 6 months ago

Hey @sanchit-gandhi! If I want to train another language such as Chinese,How many dataset I need to ready?

sanchit-gandhi commented 6 months ago

Hey @wntg - there's some detailed information about the amount of data you need for each training method at the end of this README: https://github.com/huggingface/distil-whisper/tree/main/training#overview-of-training-methods

shuaijiang commented 6 months ago

Hey @sanchit-gandhi! If I want to train another language such as Chinese,How many dataset I need to ready?

In my exp, 1000~2000 hours high quality Chinese speech data improve a lot, maybe cer from 20 to 10. 10000 hours speech data also helps, maybe cer from 10 to 5. Addationly, fine-tuning all parameters seems better than LORA, you can ref to https://github.com/shuaijiang/Whisper-Finetune/blob/master/finetune_all.py

shuaijiang commented 6 months ago

Hey @shuaijiang! For training Distil-Whisper, you can convert any custom dataset to Hugging Face Datasets' format using this guide: https://huggingface.co/docs/datasets/audio_dataset

So long as you can load your dataset as a Python object, you can convert it to Hugging Face Datasets!

thanks, I will try

xingchensong commented 6 months ago

wenet enables (full-parameter) fine-tuning of the whisper-large model in approximately 10 hours on the aishell-1 dataset, with 40 epochs and 8 * 3090 compute resources.

For more information, refer to the aishell-1 recipe available at https://github.com/wenet-e2e/wenet/tree/main/examples/aishell/whisper

I believe that using wenet will simplify the creation of local speech datasets.

Furthermore, it is significantly easier to streaming whisper by fine-tuning it under wenet's u2++ framework. Simply treat whisper as a large transformer model and leverage all existing wenet functionality (like chunk_mask, ctc-aed hybrid loss and so on). Please see https://github.com/wenet-e2e/wenet/pull/2141 for more details.

sanchit-gandhi commented 5 months ago

Definitely more data will help here! I left some recommendations in the README: https://github.com/huggingface/distil-whisper/tree/main/training#overview-of-training-methods

Really cool to see that you've been working on Chinese - excited to see the model you train 🚀 Let me know how you get on @shuaijiang!