Open kmike opened 9 years ago
I've tried to run it for 2 broad crawl problems (different countries, totally different set of websites); in first case ~11% websites provided robots.txt files with Crawl-Delay directives; in second case the number was ~13%.
How's this issue going @kmike ? Is there a PR for it?
Also, what about Visit-time
?
Reference of arcane robotstxt
directives:
https://www.ctrl.blog/entry/arcane-robotstxt-directives.html
Maybe we should reference the following:
Thanks for the links @diegovincent. There is no PR for this feature yet; contributions are welcome!
This will be a nice feature
@dannyeuu @kmike still nothing?
If that is the case, maybe @sergipastor, @sarafg11 or @psique want to take a look at this?
No updates @diegovincent.
Scrapy uses protego as robots.txt file parser under the hood as default (ROBOTSTXT_PARSER
). According to the protego readme example extracting the allowed crawl delay from robots.txt file is supported already. Seems like the abstract base class for robots.txt file parsers class RobotParser(metaclass=ABCMeta)
does not support crawl delay yet.
Crawl-Delay directive in robots.txt looks useful. If it is present the delay suggested there looks like a good way to adjust crawling rate. When crawling unknown domains it can be better than autothrottle.
A draft implementation: https://gist.github.com/kmike/76aca46cad18915b8695
Some notes: