Hi @rom1504, for dataset like the Laion-400m or 5B, it is very hard to download them to a single disk when the disk space is limited. What I did was to run img2dataset against each parquet file (which contains the url info) and img2dataset will generate a directory for each parquet file, and each dir contains about 1294 tar files (400m case). And I can move around the data.
However, if I run img2dataset on a dir which has a list of parquet files (for example Laion-400m has 32 parquet files), it seems to me that img2dataset will combine all the 32 parquet files together, and generate about 1294*32 tar files. Which is not what I want, as I want to be able to download any subset of the Laion and still want to be able to combine them together later.
My question: How can I ask img2dataset to generate a dir and download tar files to it, for each source parquet file (urls), and not combine them together?
When each dir has 1294 tar files, it is easy to specify them use the syntax of abc{00..99}/{00000..01293}.tar. Now if I run img2dataset against a few parquet, say 2, I got about 1294*2 tar files in one dir, and I don't have a good way to feed them to trainers, as different dir has different number of files.
-Steve
Hi @rom1504, for dataset like the Laion-400m or 5B, it is very hard to download them to a single disk when the disk space is limited. What I did was to run img2dataset against each parquet file (which contains the url info) and img2dataset will generate a directory for each parquet file, and each dir contains about 1294 tar files (400m case). And I can move around the data.
However, if I run img2dataset on a dir which has a list of parquet files (for example Laion-400m has 32 parquet files), it seems to me that img2dataset will combine all the 32 parquet files together, and generate about 1294*32 tar files. Which is not what I want, as I want to be able to download any subset of the Laion and still want to be able to combine them together later.
My question: How can I ask img2dataset to generate a dir and download tar files to it, for each source parquet file (urls), and not combine them together?
When each dir has 1294 tar files, it is easy to specify them use the syntax of abc{00..99}/{00000..01293}.tar. Now if I run img2dataset against a few parquet, say 2, I got about 1294*2 tar files in one dir, and I don't have a good way to feed them to trainers, as different dir has different number of files. -Steve