rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
MIT License
3.62k stars 336 forks source link

Retry only on certain HTTP codes #368

Open pabl0 opened 9 months ago

pabl0 commented 9 months ago

This is an attempt to fix #332 in a simple manner (not using anything fancy like urllib3.Retry). I think it should improve d/l performance significantly on datasets with large amounts of 404 images, but I have not done a lot of benchmarking.

I haven't found any best practices (like RFCs) wrt what HTTP codes to retry, but the following should be a reasonable list:

rom1504 commented 9 months ago

Try it out, if benchmark results look good it could be a good option.

On Thu, Dec 14, 2023, 21:45 Henrik Ahlgren @.***> wrote:

This is an attempt to fix #332 https://github.com/rom1504/img2dataset/issues/332 in a simple manner (not using anything fancy like urllib3.Retry). I think it should improve d/l performance significantly on datasets with large amounts of 404 images, but I have not done a lot of benchmarking.

I haven't found any best practices (like RFCs) wrt what HTTP codes to retry, but the following should be a reasonable list:

  • 408 Request Timeout
  • 429 Too Many Requests (respect the Retry-After header if it's in seconds and less than 10)
  • 500 Internal Server Error
  • 502 Bad Gateway
  • 503 Service Unavailable
  • 504 Gateway Timeout

You can view, comment on, or merge this pull request online at:

https://github.com/rom1504/img2dataset/pull/368 Commit Summary

File Changes

(1 file https://github.com/rom1504/img2dataset/pull/368/files)

Patch Links:

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/pull/368, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437VIQ5II2JNCMDOYHN3YJNQPJAVCNFSM6AAAAABAVOJP5SVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA2DENBVHEZTQOI . You are receiving this because you are subscribed to this thread.Message ID: @.***>