huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
MIT License
3.32k stars 238 forks source link

Cached English Common Voice dataset size. #94

Closed guynich closed 3 months ago

guynich commented 3 months ago

My disk HF cache has this folder 2.1T ./mozilla-foundation___common_voice_13_0 using command du -h ./mozilla-foundation___common_voice_13_0. I assume this is the unprocessed Common Voice dataset and it is is 2.1TB.

I created a pseudo-labelled dataset of "mozilla-foundation/common_voice_13_0" (e.g.: I assume it created the cached folder above) using pseudo-labelling script options *_config_name as "en" to create an English pseudo-labelled version of the dataset called common_voice_13_0_en_pseudo_labelled_large_v3_str.

I'm running the distillation script from the training README Stage 3 here and it is mid-process of generating train/evaluation/test splits. The cached folder for my pseudo-labelled dataset has increased to 14TB and is growing. Luckily I have an instance with expandable storage.

Inspecting the cache folders du -h ./common_voice_13_0_en_pseudo_labelled_large_v3_str/ I see multiple default folders.

977G    ./common_voice_13_0_en_pseudo_labelled_large_v3_str/default/0.0.0/86df6eb69614a3b8
81G     ./common_voice_13_0_en_pseudo_labelled_large_v3_str/default/0.0.0/d77299dbcd226395
221G    ./common_voice_13_0_en_pseudo_labelled_large_v3_str/default/0.0.0/ec2c020908a23a69
324G    ./common_voice_13_0_en_pseudo_labelled_large_v3_str/default/0.0.0/3c4d7a51735ffa53
648G    ./common_voice_13_0_en_pseudo_labelled_large_v3_str/default/0.0.0/6728d7a8c8821ed2
1.3T    ./common_voice_13_0_en_pseudo_labelled_large_v3_str/default/0.0.0/d4d5052f1224937b
2.6T    ./common_voice_13_0_en_pseudo_labelled_large_v3_str/default/0.0.0/67bfb1d58dc91573
5.1T    ./common_voice_13_0_en_pseudo_labelled_large_v3_str/default/0.0.0/ee62aeed963be186
14T     ./common_voice_13_0_en_pseudo_labelled_large_v3_str/default/0.0.0                                                                                                    
14T     ./common_voice_13_0_en_pseudo_labelled_large_v3_str/default                                                                                                          
14T     ./common_voice_13_0_en_pseudo_labelled_large_v3_str/     

Question: is this increase in size expected with training preprocessing of the dataset?

guynich commented 3 months ago

My mistake here - closing. I had wrongly assumed the HF datasets cache is the location to write the pseudo-labelled dataset. Not so.

Re-running with a pseudo-labelled dataset of 43 GB the preprocessed cache folder is 718 GB, or 17x larger.