rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
MIT License
3.71k stars 338 forks source link

Switch to requests to check headers before streaming content #307

Open raincoastchris opened 1 year ago

raincoastchris commented 1 year ago

Fixes #299

rom1504 commented 1 year ago

Interesting!

If you can run some benchmarks it would be ideal.

There are some options here to use robots.txt and/or this header checking to improve both the efficiency and the compliance to host preferences.

For some use cases we also want to discard early the url if it ends up not being an image, so being able to stop the connection if the header is not appropriate may be interesting there too