Closed Catbuttes closed 1 year ago
X-Robots-Tag is supported https://github.com/rom1504/img2dataset/blob/c9a1d4c519cc0f74012a247aaff8cface9adaf90/img2dataset/downloader.py#L22
For robots.txt see #48
The headers are only checked after the query has been executed and the file downloaded. This still unfairly places load on servers whose admins have requested to not be included. Maybe you could put a circuit break in - if a domain has had X number of X-Robot-Tags then treat all the rest of the images from that domain the same and ignore them?
This is an header per url, we cannot infer that all other urls of that domain are also banned.
https://github.com/rom1504/img2dataset/blob/c9a1d4c519cc0f74012a247aaff8cface9adaf90/img2dataset/downloader.py#L46 the current implementation does not download the content of the request (read() call) when this header is present.
Out of curiosity, can you share any details showing that this tool is causing any significant traffic to your website? This should not happen thanks to random shuffling of urls in most url datasets
Hi, python does indeed request the entire URL and its contents before a read() call. As soon as you call r.headers
it will go and GET
the entire URL, in order to get those headers. The library does not make a HEAD
request to get the headers alone. If you wish to only make a HEAD request, you have to do so explicitly (e.g. https://stackoverflow.com/questions/29327674/why-am-i-able-to-read-a-head-http-request-in-python-3-urllib-request which involved a redirect but shows the initial request being a HEAD). So this tool does download all the images before checking if they have the relevant header.
Alright that's a fair point. Let's try to implement HEAD then GET or use a different implementation that allows pausing before getting the content.
I am curious to see any traffic incurred on your website by this tool, as it's unlikely to happen given the fact most datasets are globally shuffled.
Could you be specific about that ?
I'm afraid I am not permitted to share logs outside of work - so can't provide more details than I have really. I know it isn't ideal, but at least you know about it now.
https://github.com/rom1504/img2dataset/issues/299 ok tracking the specific feature of HEAD followed by GET there
If anyone has concrete instances of traffic that can be attributed to this tool, feel free to share. Btw if you can then nothing prevents you from banning the tool
img2dataset downloads all images that are passed to it, ignoring robots.txt before throwing away any that bar it using the X-Robots-Tag. This increases load on the remote server, raising hosting costs by downloading large numbers of images when the server owner has explicitly denied consent for this to happen (in my case both in robots.txt and in the headers).
Please make it follow basic standards and respect robots.txt to avoid unfairly pushing unexpectedly high costs onto server owners. I'm sure many of us would be happy to provide the raw datasets if somebody were to ask - it would be cheaper than dealing with this.