rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
MIT License
3.64k stars 336 forks source link

Query Regarding Performance Optimisation for Large-Scale Downloads #374

Open yihong1120 opened 9 months ago

yihong1120 commented 9 months ago

Dear img2dataset Maintainers,

I hope this message finds you well. I am reaching out to discuss a potential enhancement in the performance of img2dataset when dealing with large-scale image downloads. While the tool performs admirably, I believe there is room for further optimisation, particularly when operating on datasets exceeding 100 million images.

I have observed that the download speed tends to fluctuate, and at times, the CPU and bandwidth utilisation do not reach their full potential. This observation leads me to ponder whether additional parallelisation strategies or more efficient resource allocation could be implemented.

Moreover, I have a few suggestions that might contribute to the tool's efficiency:

  1. Introducing a dynamic thread management system that can adapt to the current network and CPU load.
  2. Implementing a more sophisticated DNS caching mechanism to reduce the overhead of DNS lookups.
  3. Exploring the possibility of integrating with a CDN or other network optimisation services to enhance download speeds globally.

I am keen to hear your thoughts on these suggestions and would be delighted to contribute to the development of these enhancements.

Thank you for your time and the excellent work you have done with img2dataset. It is a vital tool for the machine learning community, and I am excited about its potential evolution.

Best regards, yihong1120

rom1504 commented 9 months ago

Hi Yihong,

Your suggestions make a lot of sense. I am interested by any improvement that would improve the speed further.

In particular

  1. A dynamic thread management would be interesting. I tried in the past to implement timeout with no great success, maybe starting more threads when some are stuck would help. One limiting factor however is the capacity of the operating system to open enough TCP connections.
  2. Yes that's in fact something some users are currently looking in as DNS lookup is difficult in some environment. I recommend knot resolver in the readme but I would appreciate any in built solution for this. I tried static resolving in the past but hit issues (DNS load balancing needs to be handled, some domain to IP mapping changes often)
  3. I'm curious on your ideas with CDN. Do you mean an externally hosted software or some CDN implementation?

Regardless, I encourage you to try out any ideas.

I found to get reproducible results of speed, using a shard from a large dataset to work well. I would usually use a shard from laion400m or laion2B-en but since those are currently down, you may use coyo700m as a replacement. Usually running the tool for a few minutes and looking at the metrics on wandb is pretty efficient