huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
MIT License
3.32k stars 238 forks source link

Unable to set concatenate_audio parameter to False in run_pseudo_labelling.py #133

Closed lq0104 closed 2 weeks ago

lq0104 commented 3 weeks ago

It seems that there might be a bug when setting the concatenate_audio parameter to False in run_pseudo_labelling.py. When attempting to do so, it results in an error.

06/03/2024 04:52:57 - INFO - main - Traceback (most recent call last): File "/home/code/distil-whisper/training/run_pseudo_labelling.py", line 1040, in main() File "/home/code/distil-whisper/training/run_pseudo_labelling.py", line 1023, in main eval_step_with_save(split=split) File "/home/code/distil-whisper/training/run_pseudo_labelling.py", line 1006, in eval_step_with_save raw_datasets[split] = raw_datasets[split].map( File "/opt/conda/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 602, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) File "/opt/conda/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 567, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3066, in map raise ValueError( ValueError: Input column ['condition_on_prev'] not in the dataset. Current columns in the dataset: ['id', 'path', 'audio', 'transcription', 'duration', 'language', 'original_speaker_id', 'session_id', 'topic', 'whisper_transcript', 'eval_preds']

Is there something I missed?