Closed ghost closed 4 months ago
The problem is that the batch size affects the pre-processing set-up: https://github.com/huggingface/distil-whisper/blob/bb6177e756ce897be49b8d6ecd8a034235785ee3/training/run_distillation.py#L1185-L1188
In retrospect, this was a poor design choice for the exact reason you've outlined above: it would have been better to define a pre-processing batch size and have this independent of the training batch size. Fixing in this PR: #81.
If you want to pre-process on-the-fly, and avoid caching the large pre-processed dataset ahead of time, you can add the --streaming
flag to your configuration. This will load the dataset as an iterable dataset, and do the pre-processing for each sample as it is loaded. This will be much faster to start training, but slower overall (since you might have to pre-process the same audio file multiple times, once for each epoch)
Hey there! While running distillation, the dataset is first preprocessed, which takes a long time on my machine (320k examples). I ran into an error while training (an OOM one), so I started again with a smaller batch size. Running the process again I found that it preprocesses the whole dataset again.
Is there a way to avoid this? Is the preprocessed dataset cached? (i would think so as it takes around 400gb of storage) The distillation script has a
preprocessing_only
flag that defaults toFalse
, but doesnt seem to address this.Cheers!