gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.2k stars 1.76k forks source link

Upgrade to Websocket breaks Request Timeout #355

Open derricw opened 5 years ago

derricw commented 5 years ago

When a server requests an upgrade to a websocket, colly will hang indefinitely regardless of timeout settings.

Here is an example program and domain that demonstrates this problem:

package main

import (
        "fmt"
        "github.com/gocolly/colly"
        "time"
)

func main() {
        // Instantiate default collector
        c := colly.NewCollector()
        c.SetRequestTimeout(5 * time.Second)

        // Before making a request print "Visiting ..."
        c.OnRequest(func(r *colly.Request) {
                fmt.Println("Visiting", r.URL.String())
        })

        c.Visit("http://www.cccamgroup.com")
}

Here is the output from curl -L -v http://www.cccamgroup.com

* Expire in 50 ms for 1 (transfer 0x560676cdf5c0)
* Expire in 50 ms for 1 (transfer 0x560676cdf5c0)
*   Trying 198.24.160.236...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x560676cdf5c0)
* Connected to www.cccamgroup.com (198.24.160.236) port 80 (#0)
> GET / HTTP/1.1
> Host: www.cccamgroup.com
> User-Agent: curl/7.64.0
> Accept: */*
> 
< HTTP/1.1 101 Switching Protocols
< Connection: Upgrade
< Upgrade: websocket
< Sec-WebSocket-Version: 13
< Sec-WebSocket-Accept: Kfh9QIsMVZcl6xEPYxPHzW8SZ8w=
< WebSocket-Server: VaughnSoft Chat
< 
derricw commented 5 years ago

Looks like the hang is happening @ ioutil.ReadAll. The problem stops occurring if you call it in a goroutine with a timeout:

    rch := make(chan error, 1)
    var body []byte
    go func() {
        body, err = ioutil.ReadAll(bodyReader)
        rch <- err
    }()
    select {
    case err = <-rch:
    case <-time.After(30 * time.Second):
        err = errors.New("read deadline reached")
    }