gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.2k stars 1.76k forks source link

Data race #36

Closed papa-stiflera closed 7 years ago

papa-stiflera commented 7 years ago

Example code:

func main() {
    c := colly.NewCollector()

    // Find and visit all links
    c.OnHTML("a", func(e *colly.HTMLElement) {
        link := e.Attr("href")
        fmt.Println(link)
        c.Visit(e.Request.AbsoluteURL(link))
    })

    c.Visit("https://en.wikipedia.org/")
}

Execution log:

==================
WARNING: DATA RACE
Write at 0x00c420083950 by main goroutine:
  runtime.mapassign_faststr()
      /usr/local/go/src/runtime/hashmap_fast.go:598 +0x0
  net/textproto.MIMEHeader.Set()
      /usr/local/go/src/net/textproto/header.go:22 +0x60
  net/http.Header.Set()
      /usr/local/go/src/net/http/header.go:31 +0x60
  net/http.(*Request).AddCookie()
      /usr/local/go/src/net/http/request.go:385 +0x37c
  net/http.(*Client).send()
      /usr/local/go/src/net/http/client.go:170 +0x115
  net/http.(*Client).Do()
      /usr/local/go/src/net/http/client.go:602 +0x513
  crawler/vendor/github.com/asciimoo/colly.(*httpBackend).Do()
      /home/skruglov/Projects/go/src/crawler/vendor/github.com/asciimoo/colly/http_backend.go:154 +0x105
  crawler/vendor/github.com/asciimoo/colly.(*httpBackend).Cache()
      /home/skruglov/Projects/go/src/crawler/vendor/github.com/asciimoo/colly/http_backend.go:110 +0x9e
  crawler/vendor/github.com/asciimoo/colly.(*Collector).scrape()
      /home/skruglov/Projects/go/src/crawler/vendor/github.com/asciimoo/colly/colly.go:226 +0x461
  crawler/vendor/github.com/asciimoo/colly.(*Collector).Visit()
      /home/skruglov/Projects/go/src/crawler/vendor/github.com/asciimoo/colly/colly.go:157 +0x9b
  main.main()
      /home/skruglov/Projects/go/src/crawler/main.go:19 +0xe4

Previous read at 0x00c420083950 by goroutine 24:
  runtime.mapaccess1_faststr()
      /usr/local/go/src/runtime/hashmap_fast.go:208 +0x0
  net/http.http2isConnectionCloseRequest()
      /usr/local/go/src/net/http/h2_bundle.go:8652 +0xae
  net/http.(*http2clientConnReadLoop).endStreamError()
      /usr/local/go/src/net/http/h2_bundle.go:8288 +0xe2
  net/http.(*http2clientConnReadLoop).endStream()
      /usr/local/go/src/net/http/h2_bundle.go:8277 +0x54
  net/http.(*http2clientConnReadLoop).processData()
      /usr/local/go/src/net/http/h2_bundle.go:8267 +0x1ce
  net/http.(*http2clientConnReadLoop).run()
      /usr/local/go/src/net/http/h2_bundle.go:7896 +0x737
  net/http.(*http2ClientConn).readLoop()
      /usr/local/go/src/net/http/h2_bundle.go:7788 +0x11c

Goroutine 24 (running) created at:
  net/http.(*http2Transport).newClientConn()
      /usr/local/go/src/net/http/h2_bundle.go:7053 +0xe1a
  net/http.(*http2Transport).NewClientConn()
      /usr/local/go/src/net/http/h2_bundle.go:6991 +0x55
  net/http.(*http2addConnCall).run()
      /usr/local/go/src/net/http/h2_bundle.go:835 +0x55
==================

#mw-head
#p-search
/wiki/Wikipedia
...
asciimoo commented 7 years ago

I can't reproduce the bug, the above code runs without errors. What is your environment (os/go version/colly version)?

papa-stiflera commented 7 years ago
$ go version
go version go1.9.1 linux/amd64

colly version: d7069d1f89470f36a4176fd962fa3e40a838cb9b (master)

asciimoo commented 7 years ago

hmm.. I can reproduce if I add the -race flag to go run command, but I don't really understand why this happens.

kataras commented 7 years ago

@papa-stiflera you didn't paste the whole code you're using in order to be able to re-produce this. At the end of your logs: /home/skruglov/Projects/go/src/crawler/main.go:19 +0xe4 but the code you gave us is not too long( maybe of the import statements? I don't know but...) so I assume your data race is somewhere else and has nothing to do with the net/http package or the colly one. Please give us the necessary information to help you, thank you!

My output (win 10 x64, go 1.9.1);

colly_issue_36_output

asciimoo commented 7 years ago

@kataras thanks for the debugging. My first assumption was the same, then I tried the above code, and the bug is reproducible - even with the below snippet:

package main

import (
    "github.com/asciimoo/colly"
)

func main() {
    colly.NewCollector().Visit("https://en.wikipedia.org/")
}

The strange thing is that if I change the URL to a service which doesn't support HTTP/2 (e.g. to https://httpbin.org/) the race disappears.

UPDATE: It's pretty sure that the bug is somehow connected to the HTTP/2 support; This command doesn't fail: GODEBUG='http2client=0' go run -race t/test_36.go and this fails: GODEBUG='http2client=1' go run -race t/test_36.go

kataras commented 7 years ago

@asciimoo It's funny because the data racer couldn't find that with the first try, I had to re-run the program more than 4 times to view the race log...

Update:

I managed to "fix" that by locking when client.Do, see below and test it by yourself, if that works just put locks there;

colly_changes

asciimoo commented 7 years ago

@kataras unfortunately your suggested solution is not applicable, because it forbids parallelism in httpBackend and the error doesn't disappear for me if I run GODEBUG='http2client=1' go run -race xy.go.

kataras commented 7 years ago

I know @asciimoo ...I suggested it as a temporary solution, I don't know the whole code base so I can't help any further for now, but if it's a net/http issue then you have to fill an issue there :/

asciimoo commented 7 years ago

The bug is reproducible without colly. The following code demonstrates the problem:

package main

import (
    "net/http"
    "net/http/cookiejar"
)

func main() {
    jar, _ := cookiejar.New(nil)
    client := &http.Client{Jar: jar}
    client.CheckRedirect = func(req *http.Request, via []*http.Request) error {
        lastRequest := via[len(via)-1]
        req.Header = lastRequest.Header
        return nil
    }
    client.Get("https://en.wikipedia.org/")
}

Seems, the bug only appears if the client has a cookie jar and a custom redirect handler which writes to http.Request.Header using HTTP/2 protocol.