robvanvolt / DALLE-datasets

This is a summary of easily available datasets for generalized DALLE-pytorch training.
MIT License
127 stars 16 forks source link

download directly as wds #5

Open rom1504 opened 3 years ago

rom1504 commented 3 years ago

would be much better than having 12M files... I will probably try this as the number of files is a problem for me with cc12m (writing the 12M captions only takes 10min), so scaling to larger number of files simply won't work in this state

I might write a generic downloader in the process

rom1504 commented 3 years ago

https://github.com/robvanvolt/DALLE-datasets/blob/main/utilities/wds_create_shards.py should help

I'm now even more convinced that this is needed after running cc12m downloader. Linux file systems are bad at handling more than a million files (it can take minutes to delete the files or even list them) cc12m would be 12M files of size 20k, 240GB in 256MB chunks that's only 938 files, which is much more manageable.

robvanvolt commented 3 years ago

Agree 100%! It was actually meant to use the utilities one day to have a direct downloader - I used them separately the most time, so I forgot about the idea to merge these functions.. Especially regarding the "crazy" number-of-files-per-folder limit;):)

rom1504 commented 3 years ago

hi @robvanvolt ; I ended up building this tool for downloading Crawling at home https://github.com/rom1504/img2dataset it can download and resize 100M image in 20h. It also saves that directly as webdataset it could download cc12m in 2.4h and cc3m in 40 minutes.

what would you think if I do a pr to replace the existing scripts to make them directly use that tool ?

robvanvolt commented 3 years ago

Definitely, awesome work! Can you do a PR so I can merge it? 👍 Also, the downloading times are amazing!:))

rom1504 commented 3 years ago

Yeah i will do it soon

rom1504 commented 2 years ago

finally got around to do this at https://github.com/rom1504/img2dataset/tree/main/examples I'm not sure how/if I should include that here as it would mostly delete the existing scripts what do you think @robvanvolt ?

robvanvolt commented 2 years ago

Really nice!

I will most likely implement some features from this repo (not all uploaded to github, like the svg support) to image2dataset and use DALLE-datasets for wds examples, wds annotations, dataset sanity check and other "useful" utilities, as image2dataset is already such a powerful tool for downloading directly into wds that this seems to be the most appropriate way to do!:) But I will sleep a few more days on that matter x)

rom1504 commented 2 years ago

sounds good!