huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
MIT License
3.32k stars 238 forks source link

Voxpopuli text column "raw_text" HF dataset card shows empty string. #98

Closed guynich closed 3 months ago

guynich commented 3 months ago

When pseudo-labelling the Voxpopuli dataset the "raw_text" (needed for option --text_column_name) may be an empty string for some examples - see HF dataset model card here for an empty "raw_text" example.

Question: how do I check which text name ("raw_text" or "normalized_text") was used when creating the pseudo-labelled datasets on HF, such as https://huggingface.co/datasets/distil-whisper/voxpopuli ?

guynich commented 3 months ago

Closing and moving the above information to https://github.com/huggingface/distil-whisper/issues/97.