Closed qiaogh97 closed 3 years ago
Hi, All three versions you mention are free of duplicate and are subset of each other, ie version 2 contains 1, 3 contains 2.
Only the 400M version (the first one) is properly released by us (that's the one we call laion400m) and you can get it from https://the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/ or https://www.kaggle.com/romainbeaumont/laion400m
The other 2 versions you mention are work in progress, and are not yet fully ready for use (for example these versions 2 and 3 are not fully randomly shuffled unlike version 1, which is an important property for use of the dataset)
We will release a larger version of the dataset with a few billions samples in a few months.
Do you have any deadlines / uses of the larger dataset (larger than 400m) on your side?
It doesn't matter, I don't have any deadlines.
So I wonder know if the parquet files in different versions are one-to-one correspondence. I download the 400 million version dataset. What should I do if I'd like to download the newest version of the dataset without downloading the duplicate files?