[Possible bug] colly returns 403, but http.Get returns 200

gocolly / colly

Elegant Scraper and Crawler Framework for Golang

Apache License 2.0

23.38k stars 1.77k forks source link

First of all, a big thank you to the creators of the project.

Secondly, a disclaimer: I'm a newbie at Golang and using colly, so this might not actually be a bug.

I'm trying to scrape a news website in order to create a text summatization dataset. Currently, my codebase is pretty large but I've managed to create a small reproducible example:

package main

import (
    "fmt"
    "strings"

    "github.com/gocolly/colly"
)

func main() {
    c := colly.NewCollector()
    var articleText string

    c.OnError(func(response *colly.Response, err error) {
        fmt.Println("Status code:", response.StatusCode, "retrying . . .")
        response.Request.Retry()
    })

    c.OnHTML(
        ".wrapper-singlepost",
        func(e *colly.HTMLElement) {
            e.ForEach("h1.entry-title, .entry-content p, .entry-content h2", func(_ int, child *colly.HTMLElement) {
                articleText += strings.TrimSpace(child.Text) + " "
            })
        },
    )
    c.Visit("https://n1info.hr/magazin/priblizava-nam-se-vrazji-komet-ovu-pojavu-necemo-moci-vidjeti-u-sljedecih-nekoliko-desetljeca/")
    fmt.Println(articleText)
}

This code snippet is supposed to extract the article text into the articleText variable and print it, but the website returns a 403 Forbidden status code. What makes this weird is when you try it out using http.Get it returns a 200 OK status code and you can see the article text present in the response body:

package main

import (
    "fmt"
    "io"
    "log"
    "net/http"
)

func main() {

    res, err := http.Get("https://n1info.hr/magazin/priblizava-nam-se-vrazji-komet-ovu-pojavu-necemo-moci-vidjeti-u-sljedecih-nekoliko-desetljeca/")

    if err != nil {
        log.Fatal("Error is not nil")
    }

    fmt.Println(res.StatusCode)

    body, err := io.ReadAll(res.Body)

    fmt.Println("Response Body:")
    fmt.Println(string(body))
}

I'm not sure whether this is a bug or am I missing something that would make this the expected behaviour. Do you have any advice on solving this issue?

package main import ( "fmt" "strings" "github.com/gocolly/colly" ) func main() { c := colly.NewCollector() // specify the user-agent header c.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60" var articleText string c.OnError(func(response *colly.Response, err error) { fmt.Println("Status code:", response.StatusCode, "retrying . . .") response.Request.Retry() }) c.OnHTML( ".wrapper-singlepost", func(e *colly.HTMLElement) { e.ForEach("h1.entry-title, .entry-content p, .entry-content h2", func(_ int, child *colly.HTMLElement) { articleText += strings.TrimSpace(child.Text) + " " }) }, ) c.Visit("https://n1info.hr/magazin/priblizava-nam-se-vrazji-komet-ovu-pojavu-necemo-moci-vidjeti-u-sljedecih-nekoliko-desetljeca/") fmt.Println(articleText) }

gocolly / colly

[Possible bug] colly returns 403, but http.Get returns 200 #824