rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
MIT License
3.66k stars 336 forks source link

downloading large dataset with limited disk space #184

Open yao-eastside opened 2 years ago

yao-eastside commented 2 years ago

Hi @rom1504, for dataset like the Laion-400m or 5B, it is very hard to download them to a single disk when the disk space is limited. What I did was to run img2dataset against each parquet file (which contains the url info) and img2dataset will generate a directory for each parquet file, and each dir contains about 1294 tar files (400m case). And I can move around the data.

However, if I run img2dataset on a dir which has a list of parquet files (for example Laion-400m has 32 parquet files), it seems to me that img2dataset will combine all the 32 parquet files together, and generate about 1294*32 tar files. Which is not what I want, as I want to be able to download any subset of the Laion and still want to be able to combine them together later.

My question: How can I ask img2dataset to generate a dir and download tar files to it, for each source parquet file (urls), and not combine them together?

When each dir has 1294 tar files, it is easy to specify them use the syntax of abc{00..99}/{00000..01293}.tar. Now if I run img2dataset against a few parquet, say 2, I got about 1294*2 tar files in one dir, and I don't have a good way to feed them to trainers, as different dir has different number of files. -Steve

mudassirkhan19 commented 1 year ago

@yao-eastside did you get a solution to this?