rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
MIT License
3.62k stars 336 forks source link

Download all of CC3M #369

Closed MohammedSB closed 9 months ago

MohammedSB commented 9 months ago

Hello,

A little bit of an unrelated question, but can someone please help me out on where to download the entire CC3M dataset? Is it hosted publicly somewhere on AWS/cloud?

A lot of the URLs in the google-provided database no longer work, so I was only able to download less than 3M out of the original 3.3M.

Would appreciate your help!

rom1504 commented 9 months ago

Downloading from the links is the definition of cc3m

MohammedSB commented 9 months ago

Yea, I get that the purpose of CC3M is to be a non-curated, web-scraped dataset, but, sadly, downloading the original data from the links no longer works since many of the images are no longer available.

I am especially interested in all (or most, within +-100k) of the data because I want to compare training methods with results from the literature.

rom1504 commented 9 months ago

You can try https://huggingface.co/datasets/pixparse/cc3m-wds