gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.38k stars 1.77k forks source link

[Possible bug] colly returns 403, but http.Get returns 200 #824

Closed ir2718 closed 2 months ago

ir2718 commented 3 months ago

First of all, a big thank you to the creators of the project.

Secondly, a disclaimer: I'm a newbie at Golang and using colly, so this might not actually be a bug.


I'm trying to scrape a news website in order to create a text summatization dataset. Currently, my codebase is pretty large but I've managed to create a small reproducible example:

package main

import (
    "fmt"
    "strings"

    "github.com/gocolly/colly"
)

func main() {
    c := colly.NewCollector()
    var articleText string

    c.OnError(func(response *colly.Response, err error) {
        fmt.Println("Status code:", response.StatusCode, "retrying . . .")
        response.Request.Retry()
    })

    c.OnHTML(
        ".wrapper-singlepost",
        func(e *colly.HTMLElement) {
            e.ForEach("h1.entry-title, .entry-content p, .entry-content h2", func(_ int, child *colly.HTMLElement) {
                articleText += strings.TrimSpace(child.Text) + " "
            })
        },
    )
    c.Visit("https://n1info.hr/magazin/priblizava-nam-se-vrazji-komet-ovu-pojavu-necemo-moci-vidjeti-u-sljedecih-nekoliko-desetljeca/")
    fmt.Println(articleText)
}

This code snippet is supposed to extract the article text into the articleText variable and print it, but the website returns a 403 Forbidden status code. What makes this weird is when you try it out using http.Get it returns a 200 OK status code and you can see the article text present in the response body:

package main

import (
    "fmt"
    "io"
    "log"
    "net/http"
)

func main() {

    res, err := http.Get("https://n1info.hr/magazin/priblizava-nam-se-vrazji-komet-ovu-pojavu-necemo-moci-vidjeti-u-sljedecih-nekoliko-desetljeca/")

    if err != nil {
        log.Fatal("Error is not nil")
    }

    fmt.Println(res.StatusCode)

    body, err := io.ReadAll(res.Body)

    fmt.Println("Response Body:")
    fmt.Println(string(body))
}

I'm not sure whether this is a bug or am I missing something that would make this the expected behaviour. Do you have any advice on solving this issue?

MrMalleable commented 2 months ago

I checked the robots.txt file of the website:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

Sitemap: https://n1info.hr/sitemap.xml

it seems that the site allow any user-agent header. But the colly framework add the default user-agent header if u didn't specify it.

c.UserAgent = "colly - https://github.com/gocolly/colly"
if hdr == nil {
     hdr = http.Header{"User-Agent": []string{c.UserAgent}}
}

The current situation is when the user-agent is 'colly - https://github.com/gocolly/colly', the response return 403. so, if you can specify the user-agent heder, the program runs ok.

package main

import (
    "fmt"
    "strings"

    "github.com/gocolly/colly"
)

func main() {
    c := colly.NewCollector()
        // specify the user-agent header
    c.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60"
    var articleText string

    c.OnError(func(response *colly.Response, err error) {
        fmt.Println("Status code:", response.StatusCode, "retrying . . .")
        response.Request.Retry()
    })

    c.OnHTML(
        ".wrapper-singlepost",
        func(e *colly.HTMLElement) {
            e.ForEach("h1.entry-title, .entry-content p, .entry-content h2", func(_ int, child *colly.HTMLElement) {
                articleText += strings.TrimSpace(child.Text) + " "
            })
        },
    )
    c.Visit("https://n1info.hr/magazin/priblizava-nam-se-vrazji-komet-ovu-pojavu-necemo-moci-vidjeti-u-sljedecih-nekoliko-desetljeca/")
    fmt.Println(articleText)
}

Hope it will help.