huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
MIT License
3.33k stars 238 forks source link

Avoid preprocessing dataset again? #78

Closed ghost closed 4 months ago

ghost commented 5 months ago

Hey there! While running distillation, the dataset is first preprocessed, which takes a long time on my machine (320k examples). I ran into an error while training (an OOM one), so I started again with a smaller batch size. Running the process again I found that it preprocesses the whole dataset again.

Is there a way to avoid this? Is the preprocessed dataset cached? (i would think so as it takes around 400gb of storage) The distillation script has a preprocessing_only flag that defaults to False, but doesnt seem to address this.

Cheers!

sanchit-gandhi commented 4 months ago

The problem is that the batch size affects the pre-processing set-up: https://github.com/huggingface/distil-whisper/blob/bb6177e756ce897be49b8d6ecd8a034235785ee3/training/run_distillation.py#L1185-L1188

In retrospect, this was a poor design choice for the exact reason you've outlined above: it would have been better to define a pre-processing batch size and have this independent of the training batch size. Fixing in this PR: #81.

If you want to pre-process on-the-fly, and avoid caching the large pre-processed dataset ahead of time, you can add the --streaming flag to your configuration. This will load the dataset as an iterable dataset, and do the pre-processing for each sample as it is loaded. This will be much faster to start training, but slower overall (since you might have to pre-process the same audio file multiple times, once for each epoch)