Closed alex77g2 closed 6 months ago
It looks like it's copying the entire dataset in the subfolders. I'll try to reproduce and fix this.
Merged https://github.com/mlcommons/algorithmic-efficiency/pull/692 into dev
.
Also added a flag to skip_download
.
If you remove the imagenet/pytorch/train
and imageent/pytorch/val
directories, I think you should be able to just resume the setup with your original command and the --skip_download
flag.
python3 datasets/dataset_setup.py --data_dir $DATA_DIR --imagenet --temp_dir $DATA_DIR/tmp --imagenet_train_url=<redacted> --imagenet_val_url=<redacted> --framework pytorch --skip_download True
I'll merge this fix into main shortly but feel free to checkout dev
and try it out.
Thanks, it (dev-branch) works much better now.
dataset_setup.py --imagenet --framework pytorch
needs unusual huge disk space (> 1000 GB after downloading) GETTING_STARTED.md tells "2 TB in storage (for datasets)" is fine.Description
For comparision: tiny-imagenet-200.zip (about 200 MB, train = 200x500 jpg) creates similar size of extracted files (as .jpg cannot be compressed by zip). ImageNet-1000 (train = 1000x1000 jpg, 141 GB download, as tar file of tar files) should also create similar overall size of JPG-files. But we run out of disk space far before 10% of files are extracted!
Steps to Reproduce
python3 datasets/dataset_setup.py --data_dir /extra2tb/data --imagenet --temp_dir /extra2tb/data/tmp --framework pytorch --imagenet_train_url https://image-net.org/data/ILSVRC/2012/HIDDEN_train.tar --imagenet_val_url https://image-net.org/data/ILSVRC/2012/HIDDEN_img_val.tar
# filenames hidden here (as they need a login) Download is ok (about 151 GB), still 1 TB free on disk now.I0307 16:08:02.928054 134648926629952 dataset_setup.py:556] Extracting imagenet train data
(everything ok until here) Script is extracting 1000x n*.tar files (144 GB together) in/extra2tb/data/imagenet/pytorch/train
(still fine). Now it's starting to get strange. (soon after disk is full - but hundreds of huge identical files cause this) Different folders (each containing the same 1000 tar files) !! All of these folders include the same list of file names (and matching names also match in content, checked for some of them). Each of these tar-files include 1300 JPEG files (different tar name causes different JPEG files inside). One example of (randomly picked).Disk is running full of identical files, much over 1000 GB, but GETTING_STARTED.md tells "2 TB in storage (for datasets)" is fine. This seems to be a bug in dataset_setup.py.
Only to show the filename repeating issue...
Source or Possible Fix
The behaviour above is abviously not intended.