img2dataset ignores X-Robots-Tag

rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

MIT License

3.71k stars 338 forks source link

img2dataset ignores X-Robots-Tag #298

Closed Catbuttes closed 1 year ago

Catbuttes commented 1 year ago

img2dataset downloads all images that are passed to it, ignoring robots.txt before throwing away any that bar it using the X-Robots-Tag. This increases load on the remote server, raising hosting costs by downloading large numbers of images when the server owner has explicitly denied consent for this to happen (in my case both in robots.txt and in the headers).

Please make it follow basic standards and respect robots.txt to avoid unfairly pushing unexpectedly high costs onto server owners. I'm sure many of us would be happy to provide the raw datasets if somebody were to ask - it would be cheaper than dealing with this.

rom1504 commented 1 year ago

X-Robots-Tag is supported https://github.com/rom1504/img2dataset/blob/c9a1d4c519cc0f74012a247aaff8cface9adaf90/img2dataset/downloader.py#L22

For robots.txt see #48

Catbuttes commented 1 year ago

The headers are only checked after the query has been executed and the file downloaded. This still unfairly places load on servers whose admins have requested to not be included. Maybe you could put a circuit break in - if a domain has had X number of X-Robot-Tags then treat all the rest of the images from that domain the same and ignore them?

rom1504 commented 1 year ago

This is an header per url, we cannot infer that all other urls of that domain are also banned.

https://github.com/rom1504/img2dataset/blob/c9a1d4c519cc0f74012a247aaff8cface9adaf90/img2dataset/downloader.py#L46 the current implementation does not download the content of the request (read() call) when this header is present.

Out of curiosity, can you share any details showing that this tool is causing any significant traffic to your website? This should not happen thanks to random shuffling of urls in most url datasets

dracos commented 1 year ago

Hi, python does indeed request the entire URL and its contents before a read() call. As soon as you call r.headers it will go and GET the entire URL, in order to get those headers. The library does not make a HEAD request to get the headers alone. If you wish to only make a HEAD request, you have to do so explicitly (e.g. https://stackoverflow.com/questions/29327674/why-am-i-able-to-read-a-head-http-request-in-python-3-urllib-request which involved a redirect but shows the initial request being a HEAD). So this tool does download all the images before checking if they have the relevant header.

rom1504 commented 1 year ago

Alright that's a fair point. Let's try to implement HEAD then GET or use a different implementation that allows pausing before getting the content.

I am curious to see any traffic incurred on your website by this tool, as it's unlikely to happen given the fact most datasets are globally shuffled.

Could you be specific about that ?

Catbuttes commented 1 year ago

I'm afraid I am not permitted to share logs outside of work - so can't provide more details than I have really. I know it isn't ideal, but at least you know about it now.

rom1504 commented 1 year ago

https://github.com/rom1504/img2dataset/issues/299 ok tracking the specific feature of HEAD followed by GET there

rom1504 commented 1 year ago

If anyone has concrete instances of traffic that can be attributed to this tool, feel free to share. Btw if you can then nothing prevents you from banning the tool