Avoid preprocessing dataset again?

huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.

MIT License

3.33k stars 238 forks source link

The problem is that the batch size affects the pre-processing set-up: https://github.com/huggingface/distil-whisper/blob/bb6177e756ce897be49b8d6ecd8a034235785ee3/training/run_distillation.py#L1185-L1188

In retrospect, this was a poor design choice for the exact reason you've outlined above: it would have been better to define a pre-processing batch size and have this independent of the training batch size. Fixing in this PR: #81.

If you want to pre-process on-the-fly, and avoid caching the large pre-processed dataset ahead of time, you can add the --streaming flag to your configuration. This will load the dataset as an iterable dataset, and do the pre-processing for each sample as it is loaded. This will be much faster to start training, but slower overall (since you might have to pre-process the same audio file multiple times, once for each epoch)

huggingface / distil-whisper

Avoid preprocessing dataset again? #78