[Question] Mixed Speech transcription

huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.

MIT License

3.33k stars 238 forks source link

[Question] Mixed Speech transcription #54

Open Tejaswgupta opened 6 months ago

Tejaswgupta commented 6 months ago

Is it possible to fine-tune Whisper/Distil-Whisper to achieve mixed speech transcription like Hindi+English in a single sentence which is common in casual conversations. Has anyone tried this before? Would training on a mixture of Hindi datasets and English datasets work? I recently used a fine-tuned Whisper for ASR and it ended up hallucinating and adding additional text which I haven't been able to fix yet.

sanchit-gandhi commented 6 months ago

Just to get a better idea, does your dataset include transcriptions in Hinglish? If so, you can try fine-tuning/distilling directly on these labels, and the model should learn the semantics of Hinglish directly from your training data

When doing so, you can set the language in the tokenizer to “hindi”, I think this is the best option to get language transfer from Hindi -> Hinglish (set argument --language="hi")

A-Raafat commented 5 months ago

could you please guide how to finetune the distilled model directly instead of training the model itself and then do distillation process?

sanchit-gandhi commented 5 months ago

Here's an overview of the training methods, with a link to direct fine-tuning: https://github.com/huggingface/distil-whisper/tree/main/training#overview-of-training-methods

Does that answer your question @A-Raafat?

A-Raafat commented 5 months ago

@sanchit-gandhi Yes. That answers it. Thank you

lq0104 commented 1 month ago

@sanchit-gandhi

Here's an overview of the training methods, with a link to direct fine-tuning: https://github.com/huggingface/distil-whisper/tree/main/training#overview-of-training-methods

Hello, I have a question. In this link, it points to the fine-tuning guide at https://huggingface.co/blog/fine-tune-whisper. However, it seems that the fine-tuning code does not support timestamp training and conditioning on previous labels in distillation (as seen in run_distillation.py with timestamp_probability and condition_on_prev_probability). I believe these two features are crucial for ASR tasks and have a significant impact on the performance. So, I would like to ask if I can use similar options like timestamp_probability and condition_on_prev_probability if I only want to do fine-tuning. Thank you.