scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
52.27k stars 10.46k forks source link

Crawl-Delay support for robots.txt #892

Open kmike opened 9 years ago

kmike commented 9 years ago

Crawl-Delay directive in robots.txt looks useful. If it is present the delay suggested there looks like a good way to adjust crawling rate. When crawling unknown domains it can be better than autothrottle.

A draft implementation: https://gist.github.com/kmike/76aca46cad18915b8695

Some notes:

kmike commented 9 years ago

I've tried to run it for 2 broad crawl problems (different countries, totally different set of websites); in first case ~11% websites provided robots.txt files with Crawl-Delay directives; in second case the number was ~13%.

asciidiego commented 4 years ago

How's this issue going @kmike ? Is there a PR for it?

Also, what about Visit-time?

Reference of arcane robotstxt directives: https://www.ctrl.blog/entry/arcane-robotstxt-directives.html

asciidiego commented 4 years ago

Maybe we should reference the following:

  1. https://github.com/scrapy/scrapy/pull/3796
  2. https://github.com/scrapy/scrapy/issues/3969
kmike commented 4 years ago

Thanks for the links @diegovincent. There is no PR for this feature yet; contributions are welcome!

dannyeuu commented 4 years ago

This will be a nice feature

asciidiego commented 4 years ago

@dannyeuu @kmike still nothing?

If that is the case, maybe @sergipastor, @sarafg11 or @psique want to take a look at this?

kmike commented 4 years ago

No updates @diegovincent.

fkromer commented 1 year ago

Scrapy uses protego as robots.txt file parser under the hood as default (ROBOTSTXT_PARSER). According to the protego readme example extracting the allowed crawl delay from robots.txt file is supported already. Seems like the abstract base class for robots.txt file parsers class RobotParser(metaclass=ABCMeta) does not support crawl delay yet.