rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
MIT License
3.74k stars 341 forks source link

Implement Robots.txt support #48

Open Scoppio opened 3 years ago

Scoppio commented 3 years ago

Scripts and softwares for automated scrapping must follow robots.txt rules, otherwise it may make the user liable for unauthorised use of data.

rom1504 commented 3 years ago

robots.txt is a file that should be used by crawlers: tools that discover urls (see https://developers.google.com/search/docs/advanced/robots/robots_txt or https://datatracker.ietf.org/doc/html/draft-koster-rep )

this tool is meant to be used after a crawler has been run, on the resulting validated urls

sebastian-nagel commented 1 year ago

Hi @rom1504,

is the robots.txt protocol really only meant only for "tools that discover urls"?

  1. RFC 9309 is addressed to "automatic clients known as crawlers". I think we can agree that img2dataset is an "automatic client" (or uses one).
  2. robots.txt files in the wild often provide rulesets addressing image "crawlers", e.g. "Googlebot-Image", "Baiduspider-image", "YandexImages".

this tool is meant to be used after a crawler has been run, on the resulting validated urls

Does this mean that it's a requirement that the crawler collecting the links only keeps links that are not disallowed in the robots.txt of the target site? I'm not aware of any web datasets that do this and erase such links from the HTML captures. Also CCBot checks the site's robots.txt before accessing any HTML page on that site but does not remove links from WARC (and WAT) captures if the link would be disallowed by the target site's robots.txt.

In other words, there are several reasons why fetching a particular image might be disallowed by robots.txt, while fetching the HTML pages linking to the image was allowed:

  1. images may be disallowed by robots.txt while HTML pages are not, e.g., by rules such as

    Disallow: /media/
    Disallow: /images/
    Disallow: *.gif$
  2. the image link was found in an HTML page on another site (the robots.txt of the site where the image is hosted may disallow fetching the image)

  3. different user-agents are used when crawling the HTML and later when fetching the images

  4. time gap: the robots.txt may change between accessing the HTML page and the image

Happy to discuss these points – maybe it's worth to reopen this issue or open a new one. Thanks!

(see also #249)

check "robots.txt", as it is out of scope, and should probably be implemented in a way that allows caching to improve performance, and avoid multiple calls to "/robots.txt" per website.

rom1504 commented 1 year ago

Makes sense. This needs to be addressed in a previous filtering step before using img2dataset then

Feel free to implement such a tool.

rom1504 commented 1 year ago

Or can you think of a way this could be implemented in an efficient way for img2dataset current architecture? I can't see how this could be done without doing 2 calls for each image. (Assuming it's even possible to find robots.txt location)

sebastian-nagel commented 1 year ago

in a previous filtering step before using img2dataset

I don't think, it's practical, given the maximum cache duration required by RFC 9309, section 2.4: "Crawlers SHOULD NOT use the cached version for more than 24 hours".

Another point would be the requirement for crawlers to "set their own name" (user-agent product token) and send it along with the HTTP request, see section 2.2.1.

Or can you think of a way this could be implemented in an efficient way for img2dataset current architecture?

I see two ways to go:

  1. a central robots.txt cache (Redis?) - somewhat similar in functionality to the recommended "caching" DNS resolver
  2. partition the input so that all URLs from a single host end up in a single partition (or very few partitions for hosts with many URLs)
    • e.g. by hash(hostname) % num_partitions
    • this allows the robots.txt rules to be cached in the worker process itself, since a single partition does not contain too many different hostnames
    • this should also reduce the load on the DNS resolver
    • this approach is implemented by Nutch and StormCrawler

(Assuming it's even possible to find robots.txt location)

The robots.txt location is defined in the RFC (it is /robots.txt), including the expected behavior if the location is redirected, or in case of a response code other than HTTP 200.

rom1504 commented 1 year ago

2 is not possible for two reasons:

1 may be possible but we'd need to find good ways to automatically deploy a kV solution to keep the tool easy to use There are other reasons that would justify using such a kv store per domain (eg speed limiting per domain), so that might be something interesting to investigate

samjsharpe commented 1 year ago

Parsing robots.txt is a solved problem in Python, it's in the stdlib. https://docs.python.org/3/library/urllib.robotparser.html

Why not start doing it once per worker and interate from that to a more efficient solution in an Agile fashion?

rom1504 commented 1 year ago

Why not start doing it once per worker and interate from that to a more efficient solution in an Agile fashion?

Feel free to give it a try

robrwo commented 11 months ago

robots.txt is a file that should be used by crawlers: tools that discover urls (see https://developers.google.com/search/docs/advanced/robots/robots_txt or https://datatracker.ietf.org/doc/html/draft-koster-rep )

this tool is meant to be used after a crawler has been run, on the resulting validated urls

No. A lot of sites were indexed by Common Crawl before their index was used to train AIs. They have since opted-out but their pages live in old copies of the index.

A website I maintain is regularly hit by img2dataset bots, even though the site now disallows and even blocks Common Crawl. The site sends the "noai" X-Robots-Tag but this is a waste of CPU and bandwidth. It makes more sense to add something to robots.txt so that these crawlers just stay away.