Open yihong1120 opened 9 months ago
Hi Yihong,
Your suggestions make a lot of sense. I am interested by any improvement that would improve the speed further.
In particular
Regardless, I encourage you to try out any ideas.
I found to get reproducible results of speed, using a shard from a large dataset to work well. I would usually use a shard from laion400m or laion2B-en but since those are currently down, you may use coyo700m as a replacement. Usually running the tool for a few minutes and looking at the metrics on wandb is pretty efficient
Dear img2dataset Maintainers,
I hope this message finds you well. I am reaching out to discuss a potential enhancement in the performance of img2dataset when dealing with large-scale image downloads. While the tool performs admirably, I believe there is room for further optimisation, particularly when operating on datasets exceeding 100 million images.
I have observed that the download speed tends to fluctuate, and at times, the CPU and bandwidth utilisation do not reach their full potential. This observation leads me to ponder whether additional parallelisation strategies or more efficient resource allocation could be implemented.
Moreover, I have a few suggestions that might contribute to the tool's efficiency:
I am keen to hear your thoughts on these suggestions and would be delighted to contribute to the development of these enhancements.
Thank you for your time and the excellent work you have done with img2dataset. It is a vital tool for the machine learning community, and I am excited about its potential evolution.
Best regards, yihong1120