make cc12m downloader fast

robvanvolt / DALLE-datasets

This is a summary of easily available datasets for generalized DALLE-pytorch training.

MIT License

127 stars 16 forks source link

make cc12m downloader fast #4

Closed rom1504 closed 3 years ago

rom1504 commented 3 years ago

use multiprocessing pool instead of pandaparallel: allow increasing the number of threads a lot more thanks to better memory handling
add thread parameter
add root_folder param (unrelated but good change)

for 32GB of ram I used 192 threads and manage to download at 60MB/s which corresponds to 300 sample/s that means 11h to download the dataset

before that change I was barely reaching 3MB/s

I recommend using https://github.com/uploadcare/pillow-simd instead of pillow for speed but not committing to setup.py as it might not work on all downloaders, maybe for another PR.

rom1504 commented 3 years ago

I think I will improve this a bit more, I don't like the fact it's taking so much memory. I think I'm going to do a staging area on the disk where I store non resized files, and sometimes resize them all. That way it should take no memory and I can make 2 different pools for resizing and downloading (downloading should have many threads and resizing not)

rom1504 commented 3 years ago

don't merge this for now. There are some memleaks in places I'm building a better solution that should work fast and for any list of image urls

rom1504 commented 3 years ago

ok made it fairly fast now. It's going to take 10h in that state and no memory issue (because I clean up between chunks) it's also now outputing in subdirs to avoid file system issues with million of files in one folder

that also prepare things to directly output tfrecord/tar chunks instead of image files

robvanvolt commented 3 years ago

ok made it fairly fast now. It's going to take 10h in that state and no memory issue (because I clean up between chunks) it's also now outputing in subdirs to avoid file system issues with million of files in one folder

that also prepare things to directly output tfrecord/tar chunks instead of image files

Awesome work! Is it still WIP or can we merge it already?

rom1504 commented 3 years ago

I think it can be merged like this.

I will continue improving this as I want such a downloader to work well for even larger list of urls, but this specific script works for this dataset!