Closed guynich closed 3 months ago
For VoxPopuli I find raw_text
to contain empty strings and think that is why my pseudo-labelling script failed after hours of compute with a ValueError("one or more references are empty strings")
.
For facebook/voxpopuli
train split I find 5,463 empty "raw_text" strings of the 182,482 examples. Each empty "raw_text" string has a corresponding non-empty "normalized_text" string.
Copied from #98 When pseudo-labelling the Voxpopuli dataset the "raw_text" (needed for option --text_column_name) may be an empty string for some examples - see HF dataset model card here for an empty "raw_text" example.
Question: how do I check which text name ("raw_text" or "normalized_text") was used when creating the pseudo-labelled datasets on HF, such as https://huggingface.co/datasets/distil-whisper/voxpopuli ?
Hey @guynich - the provided transcriptions in the original VoxPopuli dataset are only used for computing the WER in the pseudo-labelling and distillation scripts. Since the WER is computed on normalised transcriptions, you can safely use the "normalized_text"
column in the dataset, which is what was done for the Distil-Whisper datasets.
If you do decide to use the un-normalised (raw) text column, you should filter out any empty transcriptions from your dataset using a raw_datasets.filter
method, e.g. as done here: https://github.com/huggingface/distil-whisper/blob/b948d0269c6f071708c55de4a1e4030cd7726f14/training/run_distillation.py#L1224-L1236
Thank you for the helpful comment and for the fix #102. Closing.
To pseudo-label the three open-source datasets I had to re-order the table data in the Text Column and ID Column.