huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
MIT License
3.32k stars 238 forks source link

Training README datasets table: text column and id column #97

Closed guynich closed 3 months ago

guynich commented 3 months ago

To pseudo-label the three open-source datasets I had to re-order the table data in the Text Column and ID Column.

| Dataset                                                                                       | Languages | Domain                                | Speaking Style | License   | Text Column  | ID Column    |
|-----------------------------------------------------------------------------------------------|-----------|---------------------------------------|----------------|-----------|--------------|--------------|
| [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech) | 6         | Audiobooks                            | Narrated       | CC-BY-4.0 | `"text"`     | `"id"`       |
| [Common Voice 13](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0)       | 108       | Wikipedia text & crowd-sourced speech | Narrated       | CC0-1.0   | `"sentence"` | `"path"`     |
| [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli)                               | 15        | European Parliament recordings        | Spontaneous    | CC0       | `"raw_text"` | `"audio_id"` |
guynich commented 3 months ago

For VoxPopuli I find raw_text to contain empty strings and think that is why my pseudo-labelling script failed after hours of compute with a ValueError("one or more references are empty strings").

For facebook/voxpopuli train split I find 5,463 empty "raw_text" strings of the 182,482 examples. Each empty "raw_text" string has a corresponding non-empty "normalized_text" string.

guynich commented 3 months ago

Copied from #98 When pseudo-labelling the Voxpopuli dataset the "raw_text" (needed for option --text_column_name) may be an empty string for some examples - see HF dataset model card here for an empty "raw_text" example.

Question: how do I check which text name ("raw_text" or "normalized_text") was used when creating the pseudo-labelled datasets on HF, such as https://huggingface.co/datasets/distil-whisper/voxpopuli ?

sanchit-gandhi commented 3 months ago

Hey @guynich - the provided transcriptions in the original VoxPopuli dataset are only used for computing the WER in the pseudo-labelling and distillation scripts. Since the WER is computed on normalised transcriptions, you can safely use the "normalized_text" column in the dataset, which is what was done for the Distil-Whisper datasets.

If you do decide to use the un-normalised (raw) text column, you should filter out any empty transcriptions from your dataset using a raw_datasets.filter method, e.g. as done here: https://github.com/huggingface/distil-whisper/blob/b948d0269c6f071708c55de4a1e4030cd7726f14/training/run_distillation.py#L1224-L1236

guynich commented 3 months ago

Thank you for the helpful comment and for the fix #102. Closing.