Closed rom1504 closed 3 years ago
I think I will improve this a bit more, I don't like the fact it's taking so much memory. I think I'm going to do a staging area on the disk where I store non resized files, and sometimes resize them all. That way it should take no memory and I can make 2 different pools for resizing and downloading (downloading should have many threads and resizing not)
don't merge this for now. There are some memleaks in places I'm building a better solution that should work fast and for any list of image urls
ok made it fairly fast now. It's going to take 10h in that state and no memory issue (because I clean up between chunks) it's also now outputing in subdirs to avoid file system issues with million of files in one folder
that also prepare things to directly output tfrecord/tar chunks instead of image files
ok made it fairly fast now. It's going to take 10h in that state and no memory issue (because I clean up between chunks) it's also now outputing in subdirs to avoid file system issues with million of files in one folder
that also prepare things to directly output tfrecord/tar chunks instead of image files
Awesome work! Is it still WIP or can we merge it already?
I think it can be merged like this.
I will continue improving this as I want such a downloader to work well for even larger list of urls, but this specific script works for this dataset!
for 32GB of ram I used 192 threads and manage to download at 60MB/s which corresponds to 300 sample/s that means 11h to download the dataset
before that change I was barely reaching 3MB/s
I recommend using https://github.com/uploadcare/pillow-simd instead of pillow for speed but not committing to setup.py as it might not work on all downloaders, maybe for another PR.