rom1504 / laion-prepro

Get hundred of million of image+url from the crawling at home dataset and preprocess them
201 stars 20 forks source link

How to download the newest version of dataset without duplicate files? #15

Closed qiaogh97 closed 2 years ago

qiaogh97 commented 2 years ago
Hi, @rom1504 I know there are three versions of the parquet files as below. Version Parquet file size Hash value Total size
1.0 1.6G 5b54c5d5 400 million
2.0 3.6G 03f11a48 800 million
3.0 4.9G f27692e1 1.1 billion

So I wonder know if the parquet files in different versions are one-to-one correspondence. I download the 400 million version dataset. What should I do if I'd like to download the newest version of the dataset without downloading the duplicate files?

rom1504 commented 2 years ago

Hi, All three versions you mention are free of duplicate and are subset of each other, ie version 2 contains 1, 3 contains 2.

Only the 400M version (the first one) is properly released by us (that's the one we call laion400m) and you can get it from https://the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/ or https://www.kaggle.com/romainbeaumont/laion400m

The other 2 versions you mention are work in progress, and are not yet fully ready for use (for example these versions 2 and 3 are not fully randomly shuffled unlike version 1, which is an important property for use of the dataset)

We will release a larger version of the dataset with a few billions samples in a few months.

Do you have any deadlines / uses of the larger dataset (larger than 400m) on your side?

qiaogh97 commented 2 years ago

It doesn't matter, I don't have any deadlines.