Closed danielnapierski closed 1 year ago
cc12m
download in progress on sagalg13.
count: 160000
images downloaded (11GB).
du -h -d 1 /nas/gaia02/data/paper2023/
16G /nas/gaia02/data/paper2023/vg
18G /nas/gaia02/data/paper2023/vizwiz
11G /nas/gaia02/data/paper2023/cc12m
341G /nas/gaia02/data/paper2023/imagenet
384G /nas/gaia02/data/paper2023/
Process:
srun -p gaia-lg -A gaia-lg --mem 0 --gpus 4 --pty /bin/bash
...
cd /nas/gaia02/data/paper2023/cc12m
conda activate cc12m-3.9
img2dataset --url_list cc12m.tsv --input_format "tsv" --url_col "url" --caption_col "caption" --output_format webdataset --output_folder images --processes_count 16 --thread_count 64 --image_size 256 --enable_wandb False
Over 3 million images have been downloaded in CC12M. There are 12 million total, so we're only about 1/4 through.
$ du -h -d 1 /nas/gaia02/data/paper2023/
24G /nas/gaia02/data/paper2023/vg
18G /nas/gaia02/data/paper2023/vizwiz
82G /nas/gaia02/data/paper2023/cc12m
341G /nas/gaia02/data/paper2023/imagenet
465G /nas/gaia02/data/paper2023/
Done. /nas/gaia02/data/paper2023/cc12m/README.md
$ du -h -d 1 /nas/gaia02/data/paper2023/
16G /nas/gaia02/data/paper2023/vg
18G /nas/gaia02/data/paper2023/vizwiz
281G /nas/gaia02/data/paper2023/cc12m
341G /nas/gaia02/data/paper2023/imagenet
96K /nas/gaia02/data/paper2023/results
655G /nas/gaia02/data/paper2023/
I found two scripts for pulling down the CC12M data sets
I'm not sure if there is an authoritative approach. I intend to use the first approach and will link to the results when done.
I am currently building a new Python env for the first script. I ran it in Python 3.10 and got an OOM exception.