How to respect crawl delay?

gocolly / colly

Elegant Scraper and Crawler Framework for Golang

https://go-colly.org/

Apache License 2.0

23.03k stars 1.75k forks source link

How to respect crawl delay? #191

Closed mooooooooooooooooo closed 6 years ago

mooooooooooooooooo commented 6 years ago

I understand a RandomDelay was introduced recently. My c.Limit looks like:

c.Limit(&colly.LimitRule{ DomainGlob: "*", Parallelism: 1000, Delay: 1 * time.Second, }, )

(Correct me if I am wrong) this will cause a worker to sleep for x seconds before it crawls again. However, this is not the same as the crawl delay for that host. For instance, I have 1000 links to example.com and example.com has a crawl-delay specified of 5 seconds in robots.txt. I would like to respect that. However, the way it is implemented the workers will each sleep for 1 second and each will get a link to example.com and crawl it. How do I respect their crawl-delay directive?

mooooooooooooooooo commented 6 years ago

This behaviour can be seen in the random_delay example https://github.com/gocolly/colly/blob/da70a56a3ec59289f37933dc5e9158eaa4b28e78/_examples/random_delay/random_delay.go. Launch it and it will sleep initially then launch 5 simultaneous requests to httpbin.org (not nice!).

mooooooooooooooooo commented 6 years ago

In other words the randomdelay right now is kind of useless unless you are limited to 1 domain and 1 worker.

asciimoo commented 6 years ago

(Correct me if I am wrong) this will cause a worker to sleep for x seconds before it crawls again.

Parallelism and delay cannot be combined, it makes no sense to create 1000 parallel requests and wait 1sec after every parallel request imho. If you really need this feature, you can use queues: https://godoc.org/github.com/gocolly/colly/queue and sleep in on request functions.

This behaviour can be seen in the random_delay example

Nope, it just calls the onRequest callback when you call Visit, but the actual request is sent only if all the LimitRules limits are matched. So, there can be delay between the OnRequest callback and the actual network traffic.

mooooooooooooooooo commented 6 years ago

Parallelism and delay cannot be combined, it makes no sense to create 1000 parallel requests and wait 1sec after every parallel request imho.

The example I cited combines parallelism and a delay. It also doesn't have an OnRequest callback.

LeMoussel commented 6 years ago

If you want to respect robots.txt Crawl-delay directive, delay is necessary.

The Crawl-delay directive is an unofficial directive used to prevent overloading servers with too many requests. If search engines are able to overload a server, adding Crawl-delay to your robots.txt file is only a temporary fix. The fact of the matter is, your website is running on a poor hosting environment and you should fix that as soon as possibleh https://en.m.wikipedia.org/wiki/Robots_exclusion_standard