gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.39k stars 1.77k forks source link

Proxies are not rotated #759

Open regnull opened 1 year ago

regnull commented 1 year ago

I'm using an array of HTTP proxies and setting up the collector as described in the example:

c := colly.NewCollector(
        colly.MaxDepth(cfg.MaxDepth),
        colly.URLFilters(
                     // ...
        ),
    )

    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: cfg.Parallelism,
        Delay:       time.Duration(cfg.RandomDelay) * time.Millisecond,
    })
                roundRobinSwitcher, err := proxy.RoundRobinProxySwitcher(cfg.Proxy...)
        if err != nil {
            log.Fatal().Err(err).Msg("failed to create proxy switcher")
        }
        c.SetProxyFunc(roundRobinSwitcher)

However, I've noticed that only the first proxy is getting used. I've verified this by putting a breakpoint roundRobinSwitcher getProxy() function - it is called only once.

I've traced the problem here: https://cs.opensource.google/go/go/+/refs/tags/go1.19.3:src/net/http/transport.go;l=539

    if altRT := t.alternateRoundTripper(req); altRT != nil {
        if resp, err := altRT.RoundTrip(req); err != ErrSkipAltProtocol {
            return resp, err
        }
        var err error
        req, err = rewindBody(req)
        if err != nil {
            return nil, err
        }
    }

On the first pass, it doesn't go into the body of the IF, proceeds and eventually hits the GetProxy function. On the second pass, it gets the alternativeRoundTripper, goes into the IF, and returns, which means it doesn't call GetProxy function again.

Unfortunately, at this point I exceeded the limits of my knowledge and didn't research further. Perhaps someone on the team knows what is this about.

Great library, btw, thanks for your work!

POFK commented 1 year ago

I found the same problem and got the solution from #399 and #567 You should set the DisableKeepAlives as true to make sure that the ProxyFunc is called on every request.

c.WithTransport(&http.Transport{
    DisableKeepAlives: true,
})