gocolly / colly

Elegant Scraper and Crawler Framework for Golang
Apache License 2.0
23.07k stars 1.76k forks source link

User-agent switching doesn't work with Proxy #383

Open soanni opened 4 years ago

soanni commented 4 years ago

Hi there, thank you for your amazing job, it's really a great framework! I'm scratching my head several days but can't understand what's wrong ... I'm using HTTP Forward Proxy (Squid) and User-Agent switching from Colly extensions but in the Squid logs i can see that User-Agent header is default Golang-user-agent-1.1. However in OnRequest hook i can see that user-agent switching does happens, moreover the User-Agent is still custom even in http_backend.go.Do() method.

func (h *httpBackend) Do(request *http.Request, bodySize int) (*Response, error) {
        fmt.Println(request) // I'm logging to check the User-Agent is custom
    res, err := h.Client.Do(request)

But on Squid side i see 'Golang-user-agent-1.1' for every request. I suspect that smth happens with User-Agent header on net/http side particularly when you use Proxy. The code is below (i tried with ProxySwicther extension but still no luck)

       c := colly.NewCollector(
        Proxy: func(pr *http.Request) (*url.URL, error){
            parsedU, err := url.Parse(viper.GetString("squid"))
            if err != nil {
                return nil, err
            return parsedU, nil
                DisableKeepAlives: true,


    c.OnRequest(func(r *colly.Request) {
        log.Println("Visiting", r.URL)
        log.Println("UserAgent", r.Headers.Get("User-Agent"))
soanni commented 4 years ago

project is dead?

asciimoo commented 4 years ago

Hmm.. Interesting, do you get the same result if you set the proxy with Collector.SetProxy()?

makelove commented 4 years ago

Hmm.. Interesting, do you get the same result if you set the proxy with Collector.SetProxy()?


package main
import (

func main() {
    url := "https://httpbin.org/ip"
    c := colly.NewCollector(
    c.UserAgent = "curl/7.54.0"
        DisableKeepAlives: true, 

    c.OnRequest(func(r *colly.Request) {
        proxy := r.Ctx.Get("proxy")
        c.SetProxy(proxy) //Not working when colly.Async(true),
        log.Println("OnRequest proxy:", proxy)
    c.OnResponse(func(r *colly.Response) {
        // log.Println("r.Request.ProxyURL", r.Request.ProxyURL) 
        // log.Println("OnResponse Visited", r.Request.URL)

        proxy := r.Ctx.Get("proxy")// alway is same one, the Last proxy
        fmt.Println("OnResponse proxy:", proxy)
    c.OnError(func(r *colly.Response, err error) {
        log.Println("OnError ", r.StatusCode, err)
        proxy := r.Ctx.Get("proxy")
        fmt.Println("OnError proxy:", proxy)


    for idx, proxy := range proxy_list {
        fmt.Println(idx, proxy)
        var ctx = colly.NewContext()
        ctx.Put("proxy", proxy)
        c.Request("GET", url, nil, ctx, nil) 
makelove commented 4 years ago

I think your design of colly have some problem

Why can't setup proxy on every single Request like Scrapy ? it is very easy to use

asciimoo commented 4 years ago

@makelove good idea, would you like to work on it?

littlecluster commented 3 years ago

User Agent and Proxy switching is working fine for me with the below set up. I did have some trouble getting this working though - I cannot get proxy rotation to work without DisableKeepAlives=True - would it be worth updating the documentation for this?

type httpBin struct {
    Headers struct {
        UserAgent string \`json:"User-Agent"\`
    } \`json:"headers"\`
    Origin string \`json:"origin"\`

func main() {
    // Instantiate the collector
    c := colly.NewCollector(

        // apply collector settings
        colly.Async(true), // testing async settings

    // add random user agent extention

    // load proxies into round robin switcher
    rp, err := proxy.RoundRobinProxySwitcher(proxies.GetAll()...) // list of proxy strings
    if err != nil {

    // if using async then disable transport keep alives
        Proxy:             rp,
        DisableKeepAlives: true, // must be true

    // Print the response
    c.OnResponse(func(r *colly.Response) {
        obj := httpBin{}
        err := json.Unmarshal(r.Body, &obj)
        if err != nil {

        fmt.Printf("%s: %s\n", obj.Origin, obj.Headers.UserAgent)

    // create a request queue with 2 consumer threads
    q, _ := queue.New(
        2, // Number of consumer threads
        &queue.InMemoryQueueStorage{MaxSize: 10000}, // Use default queue storage

    for i := 0; i < 100; i++ {
        // Add URLs to the queue
    // Consume URLs

    // wait re async