Closed guynich closed 3 months ago
My mistake here - closing. I had wrongly assumed the HF datasets cache is the location to write the pseudo-labelled dataset. Not so.
Re-running with a pseudo-labelled dataset of 43 GB the preprocessed cache folder is 718 GB, or 17x larger.
My disk HF cache has this folder
2.1T ./mozilla-foundation___common_voice_13_0
using commanddu -h ./mozilla-foundation___common_voice_13_0
. I assume this is the unprocessed Common Voice dataset and it is is 2.1TB.I created a pseudo-labelled dataset of
"mozilla-foundation/common_voice_13_0"
(e.g.: I assume it created the cached folder above) using pseudo-labelling script options*_config_name
as"en"
to create an English pseudo-labelled version of the dataset calledcommon_voice_13_0_en_pseudo_labelled_large_v3_str
.I'm running the distillation script from the training README Stage 3 here and it is mid-process of generating train/evaluation/test splits. The cached folder for my pseudo-labelled dataset has increased to 14TB and is growing. Luckily I have an instance with expandable storage.
Inspecting the cache folders
du -h ./common_voice_13_0_en_pseudo_labelled_large_v3_str/
I see multiple default folders.Question: is this increase in size expected with training preprocessing of the dataset?