isi-vista / unified-io-inference

Apache License 2.0
0 stars 0 forks source link

Pull down CC12M data set images. #6

Closed danielnapierski closed 1 year ago

danielnapierski commented 1 year ago

I found two scripts for pulling down the CC12M data sets

I'm not sure if there is an authoritative approach. I intend to use the first approach and will link to the results when done.

I am currently building a new Python env for the first script. I ran it in Python 3.10 and got an OOM exception.

danielnapierski commented 1 year ago

cc12m download in progress on sagalg13.
count: 160000 images downloaded (11GB).

du -h -d 1 /nas/gaia02/data/paper2023/
16G /nas/gaia02/data/paper2023/vg
18G /nas/gaia02/data/paper2023/vizwiz
11G /nas/gaia02/data/paper2023/cc12m
341G    /nas/gaia02/data/paper2023/imagenet
384G    /nas/gaia02/data/paper2023/

Process:

srun -p gaia-lg -A gaia-lg --mem 0 --gpus 4 --pty /bin/bash
...
cd /nas/gaia02/data/paper2023/cc12m
conda activate cc12m-3.9
img2dataset --url_list cc12m.tsv --input_format "tsv" --url_col "url" --caption_col "caption" --output_format webdataset --output_folder images --processes_count 16 --thread_count 64 --image_size 256 --enable_wandb False
danielnapierski commented 1 year ago

Over 3 million images have been downloaded in CC12M. There are 12 million total, so we're only about 1/4 through.

$ du -h -d 1 /nas/gaia02/data/paper2023/
24G /nas/gaia02/data/paper2023/vg
18G /nas/gaia02/data/paper2023/vizwiz
82G /nas/gaia02/data/paper2023/cc12m
341G    /nas/gaia02/data/paper2023/imagenet
465G    /nas/gaia02/data/paper2023/
danielnapierski commented 1 year ago

Done. /nas/gaia02/data/paper2023/cc12m/README.md

$ du -h -d 1 /nas/gaia02/data/paper2023/
16G /nas/gaia02/data/paper2023/vg
18G /nas/gaia02/data/paper2023/vizwiz
281G    /nas/gaia02/data/paper2023/cc12m
341G    /nas/gaia02/data/paper2023/imagenet
96K /nas/gaia02/data/paper2023/results
655G    /nas/gaia02/data/paper2023/