gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.02k stars 1.75k forks source link

LimitRule is not limiting anything #575

Closed ranisalt closed 2 years ago

ranisalt commented 3 years ago

I am scraping thousands of pages from a website, which is painfully slow to do in sequence. So, I'm trying to use async mode. However, if I fire too many requests at once (and it's not that many), the service simple crashes and takes a while to restart. So I referred to the rate limit documentation. Unfortunately, that simply does nothing.

Considering the following code:

c := colly.NewCollector(colly.Async(true))

c.Limit(&colly.LimitRule{DomainGlob: "*", Parallelism: 4})

c.OnError(func(r *colly.Response, e error) {
    log.Print(e)
})

c.OnRequest(func(r *colly.Request) {
    log.Printf("Visiting %s", r.URL.String())
})

c.OnHTML("tr.standings-table__row", func(e *colly.HTMLElement) {
    // ...
})

for _, url := range urls {
    c.Visit(url)
}
c.Wait()
return

I expected at most 4 requests to fire at once, however I see Visiting <url> for all URLs almost instantly, so it is clearly firing every request at once, and this is confirmed by checking that the server I'm visiting is indeed down.

It does not work even if I set Parallelism to 1, which has the very same outcome. And before someone asks, no, other values of DomainGlob such as *domain.com* don't work either.

WGH- commented 3 years ago

Can't reproduce. Although it prints OnRequest for all URLs at once, requests on wire are properly limited (you can verify it with tcpdump).

package main

import (
    "fmt"
    "log"

    "github.com/gocolly/colly/v2"
)

func main() {
    c := colly.NewCollector(colly.Async(true))
    c.Limit(&colly.LimitRule{DomainGlob: "*", Parallelism: 2})
    c.OnRequest(func(r *colly.Request) {
        log.Printf("OnRequest %s", r.URL.String())
    })
    c.OnResponse(func(r *colly.Response) {
        log.Printf("OnResponse %s", r.Request.URL.String())
    })
    for i := 0; i < 8; i++ {
        c.Visit(fmt.Sprintf("http://httpbin.org/delay/2?%d", i))
    }
    c.Wait()
}
ranisalt commented 3 years ago

Can't reproduce. Although it prints OnRequest for all URLs at once, requests on wire are properly limited (you can verify it with tcpdump).

I will try that again, it's been a while. I was accidentally DoSing the website every time :laughing:

WGH- commented 3 years ago

You might want to add Delay as well, though, because otherwise request rate would be only limited by server response rate. Maybe that was the reason of your DoS to begin with?

roeniss commented 1 year ago

Although it prints OnRequest for all URLs at once, requests on wire are properly limited

Looks like a very important point.