gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.39k stars 1.77k forks source link

Error trying to conditionally set up proxy function #806

Closed sstehniy closed 9 months ago

sstehniy commented 9 months ago

I'm working on a web crawler using the Colly library in Go, and I've encountered an issue with setting up a proxy function dynamically when the collector receives a "429 Too Many Requests" response. The goal is to enable a proxy to bypass the rate limit and then disable it after 30 seconds. However, when I try to set up the proxy switcher in response to a 429 error, I get an error: "Failed to create proxy switcher: Too Many Requests".

Note: everything works fine when i am setting the proxy just after creating the Collector, before first colelctor.Visit call is being made.

Collector configuration:

    c := colly.NewCollector(colly.Async(true), colly.AllowedDomains(mainHost))

    // Limit the number of concurrent requests to 10
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 10,
        Delay:       100 * time.Millisecond,
        RandomDelay: 100 * time.Millisecond,
    })

Here's the relevant part of my code:

c.OnError(func(r *colly.Response, err error) {
    if r.StatusCode == 429 {
        log.Println("Received 429 Too Many Requests, enabling proxy")

        proxyList := getProxyList("./proxies.txt")
        rp, err := proxy.RoundRobinProxySwitcher(proxyList...)

        if err != nil {
            log.Fatal("Failed to create proxy switcher:", err)
        }

        c.SetProxyFunc(rp) // <--------- here the occurs
        globalMutex.Lock()
        if proxyEnabled {
            disableProxyChan <- true
            proxyEnabled = false
            log.Println("Proxy disabled due to 429")
        } else {
            proxyEnabled = true
            globalMutex.Unlock()
            go manageProxy(c)
            log.Println("Proxy switcher created")
        }

        c.Visit(r.Request.URL.String())
    }
})

And the manageProxy function:

func manageProxy(c *colly.Collector) {
    ticker := time.NewTicker(30 * time.Second)
    defer ticker.Stop()

    for {
        select {
        case <-ticker.C:
            globalMutex.Lock()
            c.SetProxyFunc(nil)
            proxyEnabled = false
            log.Println("Proxy disabled due to timeout")
            globalMutex.Unlock()
        case <-disableProxyChan:
            return
        }
    }
}

I expected the proxy to be enabled upon receiving a 429 error and then automatically disabled after 30 seconds. However, the error "Failed to create proxy switcher: Too Many Requests" is thrown immediately when trying to create the proxy switcher, preventing the proxy from being set up.

Has anyone encountered a similar issue or can provide insight into what might be going wrong here? Any help or suggestions would be greatly appreciated.

sstehniy commented 9 months ago

I understand that maybe i am doing it wrong, but right now no other method comes to my mind that would not involve heavily refactoring and extending my crawler logic, so i would like to know what is wrong here before i start rewriting everything... Thx