robvanvolt / DALLE-datasets

This is a summary of easily available datasets for generalized DALLE-pytorch training.
MIT License
127 stars 16 forks source link

add a table with dataset sizes #1

Open rom1504 opened 3 years ago

rom1504 commented 3 years ago

having a table with dataset and some information about size/ time to download would be useful https://docs.google.com/document/d/1KCAB-OTHphcCh-4oITIL8r7ih-HuslMKX1Rls_P03CY/edit could serve as complementary information

rom1504 commented 3 years ago

I will add information here as I download things. Starting with CC3M, I intend to download it then produce some clip embeddings (using https://github.com/rom1504/clip-retrieval/) / list of clip filtered files

Once it's clear enough, will PR to readme

rom1504 commented 3 years ago

I downloaded cc3m and cc12m (improving their script a bit in the process)

cc3m can obviously take way less time if using the improved script of cc12m I confirmed in the process that handling million of files is painful and will make it possible to download directly as collection of tars (== webdataset format)

kartikpodugu commented 1 year ago

@rom1504 the doc is not available now. i want to download the data, can you please help me. I just find download_open_images.txt file in the repo. how to download using text file ?