Open Scoppio opened 3 years ago
robots.txt is a file that should be used by crawlers: tools that discover urls (see https://developers.google.com/search/docs/advanced/robots/robots_txt or https://datatracker.ietf.org/doc/html/draft-koster-rep )
this tool is meant to be used after a crawler has been run, on the resulting validated urls
Hi @rom1504,
is the robots.txt protocol really only meant only for "tools that discover urls"?
this tool is meant to be used after a crawler has been run, on the resulting validated urls
Does this mean that it's a requirement that the crawler collecting the links only keeps links that are not disallowed in the robots.txt of the target site? I'm not aware of any web datasets that do this and erase such links from the HTML captures. Also CCBot checks the site's robots.txt before accessing any HTML page on that site but does not remove links from WARC (and WAT) captures if the link would be disallowed by the target site's robots.txt.
In other words, there are several reasons why fetching a particular image might be disallowed by robots.txt, while fetching the HTML pages linking to the image was allowed:
images may be disallowed by robots.txt while HTML pages are not, e.g., by rules such as
Disallow: /media/
Disallow: /images/
Disallow: *.gif$
the image link was found in an HTML page on another site (the robots.txt of the site where the image is hosted may disallow fetching the image)
different user-agents are used when crawling the HTML and later when fetching the images
time gap: the robots.txt may change between accessing the HTML page and the image
Happy to discuss these points – maybe it's worth to reopen this issue or open a new one. Thanks!
(see also #249)
check "robots.txt", as it is out of scope, and should probably be implemented in a way that allows caching to improve performance, and avoid multiple calls to "/robots.txt" per website.
Makes sense. This needs to be addressed in a previous filtering step before using img2dataset then
Feel free to implement such a tool.
Or can you think of a way this could be implemented in an efficient way for img2dataset current architecture? I can't see how this could be done without doing 2 calls for each image. (Assuming it's even possible to find robots.txt location)
in a previous filtering step before using img2dataset
I don't think, it's practical, given the maximum cache duration required by RFC 9309, section 2.4: "Crawlers SHOULD NOT use the cached version for more than 24 hours".
Another point would be the requirement for crawlers to "set their own name" (user-agent product token) and send it along with the HTTP request, see section 2.2.1.
Or can you think of a way this could be implemented in an efficient way for img2dataset current architecture?
I see two ways to go:
hash(hostname) % num_partitions
(Assuming it's even possible to find robots.txt location)
The robots.txt location is defined in the RFC (it is /robots.txt
), including the expected behavior if the location is redirected, or in case of a response code other than HTTP 200.
2 is not possible for two reasons:
1 may be possible but we'd need to find good ways to automatically deploy a kV solution to keep the tool easy to use There are other reasons that would justify using such a kv store per domain (eg speed limiting per domain), so that might be something interesting to investigate
Parsing robots.txt is a solved problem in Python, it's in the stdlib. https://docs.python.org/3/library/urllib.robotparser.html
Why not start doing it once per worker and interate from that to a more efficient solution in an Agile fashion?
Why not start doing it once per worker and interate from that to a more efficient solution in an Agile fashion?
Feel free to give it a try
robots.txt is a file that should be used by crawlers: tools that discover urls (see https://developers.google.com/search/docs/advanced/robots/robots_txt or https://datatracker.ietf.org/doc/html/draft-koster-rep )
this tool is meant to be used after a crawler has been run, on the resulting validated urls
No. A lot of sites were indexed by Common Crawl before their index was used to train AIs. They have since opted-out but their pages live in old copies of the index.
A website I maintain is regularly hit by img2dataset bots, even though the site now disallows and even blocks Common Crawl. The site sends the "noai" X-Robots-Tag but this is a waste of CPU and bandwidth. It makes more sense to add something to robots.txt so that these crawlers just stay away.
Scripts and softwares for automated scrapping must follow robots.txt rules, otherwise it may make the user liable for unauthorised use of data.