gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.2k stars 1.76k forks source link

Weird async behaviour - duplicates in responses #802

Open AlexS778 opened 9 months ago

AlexS778 commented 9 months ago

Hello guys, recently I was using crawler to crawl some stuff and it was taking quite a lot of time, so I decided to use async mode. While using the async mode I've noticed a lot of duplicates in my results, especially number of duplicates was matching the number of threads I was launching my crawler.

Here is a quick example, let's take an example from official docs - https://github.com/gocolly/colly/blob/master/_examples/rate_limit/rate_limit.go

func main() {
    url := "https://httpbin.org/delay/2"

    // Instantiate default collector
    c := colly.NewCollector(
        // Turn on asynchronous requests
        colly.Async(true),
    )

    // Start scraping in five threads on https://httpbin.org/delay/2
    for i := 0; i < 5; i++ {

        c.OnResponse(func(response *colly.Response) {
            fmt.Println(string(response.Body))
        })

        c.Visit(fmt.Sprintf("%s?n=%d", url, i))
    }
    // Wait until threads are finished
    c.Wait()
}

If we would launch this code, we can see the results:

A lot of text here with http body response ```json { "args": { "n": "3" }, "data": "", "files": {}, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip", "Host": "httpbin.org", "User-Agent": "colly - https://github.com/gocolly/colly/v2", "X-Amzn-Trace-Id": "Root=1-659818d1-0ce769125429588340e95d6c" }, "origin": "83.139.137.160", "url": "https://httpbin.org/delay/2?n=3" } { "args": { "n": "3" }, "data": "", "files": {}, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip", "Host": "httpbin.org", "User-Agent": "colly - https://github.com/gocolly/colly/v2", "X-Amzn-Trace-Id": "Root=1-659818d1-0ce769125429588340e95d6c" }, "origin": "83.139.137.160", "url": "https://httpbin.org/delay/2?n=3" } { "args": { "n": "3" }, "data": "", "files": {}, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip", "Host": "httpbin.org", "User-Agent": "colly - https://github.com/gocolly/colly/v2", "X-Amzn-Trace-Id": "Root=1-659818d1-0ce769125429588340e95d6c" }, "origin": "83.139.137.160", "url": "https://httpbin.org/delay/2?n=3" } { "args": { "n": "1" }, "data": "", "files": {}, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip", "Host": "httpbin.org", "User-Agent": "colly - https://github.com/gocolly/colly/v2", "X-Amzn-Trace-Id": "Root=1-659818d1-41c8deb73f2c9a702e3a9fcd" }, "origin": "83.139.137.160", "url": "https://httpbin.org/delay/2?n=1" } { "args": { "n": "1" }, "data": "", "files": {}, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip", "Host": "httpbin.org", "User-Agent": "colly - https://github.com/gocolly/colly/v2", "X-Amzn-Trace-Id": "Root=1-659818d1-41c8deb73f2c9a702e3a9fcd" }, "origin": "83.139.137.160", "url": "https://httpbin.org/delay/2?n=1" } { "args": { "n": "1" }, "data": "", "files": {}, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip", "Host": "httpbin.org", "User-Agent": "colly - https://github.com/gocolly/colly/v2", "X-Amzn-Trace-Id": "Root=1-659818d1-41c8deb73f2c9a702e3a9fcd" }, "origin": "83.139.137.160", "url": "https://httpbin.org/delay/2?n=1" } { "args": { "n": "1" }, "data": "", "files": {}, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip", "Host": "httpbin.org", "User-Agent": "colly - https://github.com/gocolly/colly/v2", "X-Amzn-Trace-Id": "Root=1-659818d1-41c8deb73f2c9a702e3a9fcd" }, "origin": "83.139.137.160", "url": "https://httpbin.org/delay/2?n=1" } { "args": { "n": "1" }, "data": "", "files": {}, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip", "Host": "httpbin.org", "User-Agent": "colly - https://github.com/gocolly/colly/v2", "X-Amzn-Trace-Id": "Root=1-659818d1-41c8deb73f2c9a702e3a9fcd" }, "origin": "83.139.137.160", "url": "https://httpbin.org/delay/2?n=1" } ```

As you can see, there are duplicates in results. Maybe I'm doing something wrong, not setting up crawler properly, but still I highly doubt if this is a intended behaviour. Anyways, would appreciate any help.

hugokung commented 9 months ago

Because c.OnResponse is executed 5 times in the loop, and each time the incoming parameters are added to c.responseCallbacks in the form of an append, each goroutine executes all the functions in c.responseCallbacks when it completes the request.