gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.2k stars 1.76k forks source link

Proxy Authentication #343

Open andystroz opened 5 years ago

andystroz commented 5 years ago

I was wondering how to use the proxy functionality with Basic Auth? I've modified the callback function in insert the base64 encoded username and password with the Proxy-Authorization header. I have also verified the same HTTP request works when using https://httpbin.org/ip in cURL but Colly returns Error: Proxy Authentication Required.

andystroz commented 5 years ago

Just saw this issue and I think its related because http://httpbin.org/ip works but https://httpbin.org/ip does not. The website I am looking to scrape uses https

suisun2015 commented 4 years ago

Hi Andrew I met same error. Did you figure it out?

andystroz commented 4 years ago

Yea! I did end up getting it working, hopefully this still helps.

c := colly.NewCollector()
c.WithTransport(s.HTTPTransport)

Where s.HTTPTransport is:

&http.Transport{
    Proxy: rp,
    DialContext: (&net.Dialer{
        Timeout:   30 * time.Second,
        KeepAlive: 30 * time.Second,
        DualStack: true,
    }).DialContext,
    MaxIdleConns:          100,
    IdleConnTimeout:       90 * time.Second,
    TLSHandshakeTimeout:   10 * time.Second,
    ExpectContinueTimeout: 1 * time.Second,
    TLSNextProto:          nil,
}

Where rp is a function that specifies a proxy to use.

// AuthenticatedRoundRobinProxyHTTP returns the proxy fuction used by http for an authenticated round robin proxy
func AuthenticatedRoundRobinProxyHTTP(proxyURLs []string, username string, password string) (func(*http.Request) (*url.URL, error), error) {
    roundRobinSwitcher, err := collyProxy.RoundRobinProxySwitcher(proxyURLs...)
    if err != nil {
        return nil, err
    }
    return (&authenticatedRoundRobinSwitcher{roundRobinSwitcher, username, password}).GetAuthenticatedProxy, nil
}

Where proxyURLs []string is a list of socks5 proxy URLs ex. socks5://127.0.0.1:1337,socks5://127.0.0.1:1338 and &authenticatedRoundRobinSwitcher{roundRobinSwitcher, username, password}).GetAuthenticatedProxy returns (request *http.Request).This request has correct proxy headers and the URL returned from the colly round robin switcher.

type authenticatedRoundRobinSwitcher struct {
    roundRobinSwitcher colly.ProxyFunc
    username           string
    password           string
}

func (r *authenticatedRoundRobinSwitcher) GetAuthenticatedProxy(request *http.Request) (*url.URL, error) {
    //Adding proxy authentication
    auth := r.username + ":" + r.password
    basicAuth := "Basic " + base64.StdEncoding.EncodeToString([]byte(auth))
    request.Header.Add("Proxy-Authorization", basicAuth)
    request.Header.Add("Proxy-Connection", "Keep-Alive")
    url, err := r.roundRobinSwitcher(request)
    if err != nil {
        return nil, err
    }
    return url, nil
}

This was a doozy to figure out so lmk if you got any questions :)

hsinhoyeh commented 2 years ago

it seems we can specify username/password on the proxy itself.

roundRobinSwitcher, err := collyProxy.RoundRobinProxySwitcher("socks5://username:password@127.0.0.1:1337")

and the http package will add auth header for us, check: https://github.com/golang/go/blob/master/src/net/http/transport.go#L1624

case cm.proxyURL.Scheme == "socks5":
        conn := pconn.conn
        d := socksNewDialer("tcp", conn.RemoteAddr().String())
        if u := cm.proxyURL.User; u != nil {
            auth := &socksUsernamePassword{
                Username: u.Username(),
            }
            auth.Password, _ = u.Password()
            d.AuthMethods = []socksAuthMethod{
                socksAuthMethodNotRequired,
                socksAuthMethodUsernamePassword,
            }
            d.Authenticate = auth.Authenticate
        }
        if _, err := d.DialWithConn(ctx, conn, "tcp", cm.targetAddr); err != nil {
            conn.Close()
            return nil, err
        }