Crawl-Delay support for robots.txt

scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.

https://scrapy.org

BSD 3-Clause "New" or "Revised" License

52.27k stars 10.46k forks source link

Crawl-Delay support for robots.txt #892

Open kmike opened 9 years ago

kmike commented 9 years ago

Crawl-Delay directive in robots.txt looks useful. If it is present the delay suggested there looks like a good way to adjust crawling rate. When crawling unknown domains it can be better than autothrottle.

A draft implementation: https://gist.github.com/kmike/76aca46cad18915b8695

Some notes:

it may be useful to respect only Crawl-Delay but not User-Agent rules, so this middleware is not inherited from an existing middleware;
Crawl-Delay maps nicely to Scrapy downloader slots; it is possible to implement this feature without per-request download delays (see https://github.com/scrapy/scrapy/issues/802) for which semantics are unclear;
currently RobotsCrawlDelayMiddleware doesn't use User-Agent because UA set by Scrapy is quite arbitrary, but I'm not sure this is a right thing to do;
instead of stdlib robotparser https://github.com/seomoz/reppy is used; robotparser can't parse robots.txt with Crawl-Delay properly.

kmike commented 9 years ago

I've tried to run it for 2 broad crawl problems (different countries, totally different set of websites); in first case ~11% websites provided robots.txt files with Crawl-Delay directives; in second case the number was ~13%.

asciidiego commented 4 years ago

How's this issue going @kmike ? Is there a PR for it?

Also, what about Visit-time?

Reference of arcane robotstxt directives: https://www.ctrl.blog/entry/arcane-robotstxt-directives.html

asciidiego commented 4 years ago

Maybe we should reference the following:

kmike commented 4 years ago

Thanks for the links @diegovincent. There is no PR for this feature yet; contributions are welcome!

dannyeuu commented 4 years ago

This will be a nice feature

asciidiego commented 4 years ago

@dannyeuu @kmike still nothing?

If that is the case, maybe @sergipastor, @sarafg11 or @psique want to take a look at this?

kmike commented 4 years ago

No updates @diegovincent.

fkromer commented 1 year ago

Scrapy uses protego as robots.txt file parser under the hood as default (ROBOTSTXT_PARSER). According to the protego readme example extracting the allowed crawl delay from robots.txt file is supported already. Seems like the abstract base class for robots.txt file parsers class RobotParser(metaclass=ABCMeta) does not support crawl delay yet.