rom1504 / laion-prepro

Get hundred of million of image+url from the crawling at home dataset and preprocess them
204 stars 20 forks source link

How many about the dataset? #13

Closed qiaogh97 closed 3 years ago

qiaogh97 commented 3 years ago

Hi, @rom1504 I download the 32 parquet files and compute the total of url. I find about 26760000 urls in every parquet, and 32*26760000 = 800 million. But you said the number of this dataset is 400m? So what is the difference?

rom1504 commented 3 years ago

Hi, where did you download the parquet from? http://the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/ has laion400m

If you downloaded from 3080.rom1504.fr you probably got a more recent version of the dataset that is indeed much bigger (and not really released yet)

rom1504 commented 3 years ago

Ah yes I see I left that 3080 link in the readme, i need to fix it :)

qiaogh97 commented 3 years ago

Ok, I see