mlcommons / algorithmic-efficiency

MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models.
https://mlcommons.org/en/groups/research-algorithms/
Apache License 2.0
321 stars 62 forks source link

dataset_setup.py --imagenet --framework pytorch # impossible to extract ILSVRC2012_img_train.tar (hundreds of duplicate tar-files) #688

Closed alex77g2 closed 6 months ago

alex77g2 commented 6 months ago

dataset_setup.py --imagenet --framework pytorch needs unusual huge disk space (> 1000 GB after downloading) GETTING_STARTED.md tells "2 TB in storage (for datasets)" is fine.

Description

For comparision: tiny-imagenet-200.zip (about 200 MB, train = 200x500 jpg) creates similar size of extracted files (as .jpg cannot be compressed by zip). ImageNet-1000 (train = 1000x1000 jpg, 141 GB download, as tar file of tar files) should also create similar overall size of JPG-files. But we run out of disk space far before 10% of files are extracted!

Steps to Reproduce

python3 datasets/dataset_setup.py --data_dir /extra2tb/data --imagenet --temp_dir /extra2tb/data/tmp --framework pytorch --imagenet_train_url https://image-net.org/data/ILSVRC/2012/HIDDEN_train.tar --imagenet_val_url https://image-net.org/data/ILSVRC/2012/HIDDEN_img_val.tar # filenames hidden here (as they need a login) Download is ok (about 151 GB), still 1 TB free on disk now. I0307 16:08:02.928054 134648926629952 dataset_setup.py:556] Extracting imagenet train data (everything ok until here) Script is extracting 1000x n*.tar files (144 GB together) in /extra2tb/data/imagenet/pytorch/train (still fine). Now it's starting to get strange. (soon after disk is full - but hundreds of huge identical files cause this) Different folders (each containing the same 1000 tar files) !! All of these folders include the same list of file names (and matching names also match in content, checked for some of them). Each of these tar-files include 1300 JPEG files (different tar name causes different JPEG files inside). One example of (randomly picked).

/extra2tb/data/imagenet/pytorch/train$ cmp n03692522/n01440764.tar n03602883/n01440764.tar # same content other folder
/extra2tb/data/imagenet/pytorch/train$ cksum n03692522/n01440764.tar n03602883/n01440764.tar
2049352163 157368320 n03692522/n01440764.tar
2049352163 157368320 n03602883/n01440764.tar
/extra2tb/data/imagenet/pytorch/train$ du -h
138G    ./n04442312
138G    ./n04239074
138G    ./n04356056
138G    ./n04228054
etc etc etc (expect 1000 of these folders would need 138 GB x 1000 = 138 TB disk size of extracting this dataset = very huge !!.

Disk is running full of identical files, much over 1000 GB, but GETTING_STARTED.md tells "2 TB in storage (for datasets)" is fine. This seems to be a bug in dataset_setup.py.

Only to show the filename repeating issue...

/extra2tb/data/imagenet/pytorch/train$ ls n03692522/*.tar | head
n03692522/n01440764.tar
n03692522/n01443537.tar
n03692522/n01484850.tar
n03692522/n01491361.tar
n03692522/n01494475.tar
n03692522/n01496331.tar
n03692522/n01498041.tar
n03692522/n01514668.tar
n03692522/n01514859.tar
n03692522/n01518878.tar
/extra2tb/data/imagenet/pytorch/train$ ls n03602883/*.tar | head
n03602883/n01440764.tar
n03602883/n01443537.tar
n03602883/n01484850.tar
n03602883/n01491361.tar
n03602883/n01494475.tar
n03602883/n01496331.tar
n03602883/n01498041.tar
n03602883/n01514668.tar
n03602883/n01514859.tar
n03602883/n01518878.tar

Source or Possible Fix

The behaviour above is abviously not intended.

priyakasimbeg commented 6 months ago

It looks like it's copying the entire dataset in the subfolders. I'll try to reproduce and fix this.

priyakasimbeg commented 6 months ago

Merged https://github.com/mlcommons/algorithmic-efficiency/pull/692 into dev. Also added a flag to skip_download. If you remove the imagenet/pytorch/train and imageent/pytorch/val directories, I think you should be able to just resume the setup with your original command and the --skip_download flag.

python3 datasets/dataset_setup.py --data_dir $DATA_DIR --imagenet --temp_dir $DATA_DIR/tmp --imagenet_train_url=<redacted> --imagenet_val_url=<redacted> --framework pytorch --skip_download True

I'll merge this fix into main shortly but feel free to checkout dev and try it out.

alex77g2 commented 6 months ago

Thanks, it (dev-branch) works much better now.