huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
MIT License
3.33k stars 238 forks source link

Training without real labels #58

Open rolisz opened 6 months ago

rolisz commented 6 months ago

Is it possible to do the pseudo labeling without access to already transcribed audios?

From what I see in the training scripts, the dataset should have a text column, so it's not possible to just use a bunch of audio to distill a Whisper model.

sanchit-gandhi commented 6 months ago

Hey @rolisz! It would indeed be possible to use audio-only samples during pseudo-labelling. You're correct in that the pseudo-labelling script currently assumes we're pseudo-labelling a dataset of (audio, text) pairs, but there's no reason why we couldn't generalise this to just (audio) examples. This should be pretty simple: you can just rip out all the references to "text_column_name" and "labels" in the pseudo labelling script.