gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.07k stars 1.76k forks source link

User-agent switching doesn't work with Proxy #383

Open soanni opened 4 years ago

soanni commented 4 years ago

Hi there, thank you for your amazing job, it's really a great framework! I'm scratching my head several days but can't understand what's wrong ... I'm using HTTP Forward Proxy (Squid) and User-Agent switching from Colly extensions but in the Squid logs i can see that User-Agent header is default Golang-user-agent-1.1. However in OnRequest hook i can see that user-agent switching does happens, moreover the User-Agent is still custom even in http_backend.go.Do() method.

func (h *httpBackend) Do(request *http.Request, bodySize int) (*Response, error) {
        fmt.Println(request) // I'm logging to check the User-Agent is custom
    res, err := h.Client.Do(request)

But on Squid side i see 'Golang-user-agent-1.1' for every request. I suspect that smth happens with User-Agent header on net/http side particularly when you use Proxy. The code is below (i tried with ProxySwicther extension but still no luck)

       c := colly.NewCollector(
        colly.Debugger(&debug.LogDebugger{}),
        colly.Async(true),
    )
    c.WithTransport(&http.Transport{
        Proxy: func(pr *http.Request) (*url.URL, error){
            parsedU, err := url.Parse(viper.GetString("squid"))
            if err != nil {
                return nil, err
            }
            return parsedU, nil
        },
                DisableKeepAlives: true,
    })

    extensions.RandomUserAgent(c)

    c.OnRequest(func(r *colly.Request) {
        log.Println("Visiting", r.URL)
        log.Println("UserAgent", r.Headers.Get("User-Agent"))
    })
soanni commented 4 years ago

project is dead?

asciimoo commented 4 years ago

Hmm.. Interesting, do you get the same result if you set the proxy with Collector.SetProxy()?

makelove commented 4 years ago

Hmm.. Interesting, do you get the same result if you set the proxy with Collector.SetProxy()?

yes

package main
import (
    "fmt"
    "log"
    "net/http"
    "github.com/gocolly/colly"
)

func main() {
    url := "https://httpbin.org/ip"
    c := colly.NewCollector(
        colly.AllowURLRevisit(),
        colly.Async(true),
    )
    c.UserAgent = "curl/7.54.0"
    c.WithTransport(&http.Transport{ 
        DisableKeepAlives: true, 
    })

    c.OnRequest(func(r *colly.Request) {
        proxy := r.Ctx.Get("proxy")
        c.SetProxy(proxy) //Not working when colly.Async(true),
        log.Println("OnRequest proxy:", proxy)
    })
    c.OnResponse(func(r *colly.Response) {
        log.Println("OnResponse")
        // log.Println("r.Request.ProxyURL", r.Request.ProxyURL) 
        // log.Println("OnResponse Visited", r.Request.URL)

        log.Println(string(r.Body[:]))
        proxy := r.Ctx.Get("proxy")// alway is same one, the Last proxy
        fmt.Println("OnResponse proxy:", proxy)
        fmt.Println("------------")
    })
    c.OnError(func(r *colly.Response, err error) {
        log.Println("OnError ", r.StatusCode, err)
        proxy := r.Ctx.Get("proxy")
        fmt.Println("OnError proxy:", proxy)

        fmt.Println("------------")
    })

    for idx, proxy := range proxy_list {
        fmt.Println(idx, proxy)
        var ctx = colly.NewContext()
        ctx.Put("proxy", proxy)
        c.Request("GET", url, nil, ctx, nil) 
    }
    c.Wait()
}
makelove commented 4 years ago

I think your design of colly have some problem

Why can't setup proxy on every single Request like Scrapy ? it is very easy to use

asciimoo commented 4 years ago

@makelove good idea, would you like to work on it?

littlecluster commented 3 years ago

User Agent and Proxy switching is working fine for me with the below set up. I did have some trouble getting this working though - I cannot get proxy rotation to work without DisableKeepAlives=True - would it be worth updating the documentation for this?

type httpBin struct {
    Headers struct {
        UserAgent string \`json:"User-Agent"\`
    } \`json:"headers"\`
    Origin string \`json:"origin"\`
}

func main() {
    // Instantiate the collector
    c := colly.NewCollector(

        // apply collector settings
        colly.AllowURLRevisit(),
        colly.Async(true), // testing async settings
    )

    // add random user agent extention
    extensions.RandomUserAgent(c)

    // load proxies into round robin switcher
    rp, err := proxy.RoundRobinProxySwitcher(proxies.GetAll()...) // list of proxy strings
    if err != nil {
        log.Fatal(err)
    }

    // if using async then disable transport keep alives
    c.WithTransport(&http.Transport{
        Proxy:             rp,
        DisableKeepAlives: true, // must be true
    })

    // Print the response
    c.OnResponse(func(r *colly.Response) {
        obj := httpBin{}
        err := json.Unmarshal(r.Body, &obj)
        if err != nil {
            log.Fatal(err)
        }

        fmt.Printf("%s: %s\n", obj.Origin, obj.Headers.UserAgent)
    })

    // create a request queue with 2 consumer threads
    q, _ := queue.New(
        2, // Number of consumer threads
        &queue.InMemoryQueueStorage{MaxSize: 10000}, // Use default queue storage
    )

    for i := 0; i < 100; i++ {
        // Add URLs to the queue
        q.AddURL("https://httpbin.org/get")
    }
    // Consume URLs
    q.Run(c)

    // wait re async
    c.Wait()
}