Closed ir2718 closed 2 months ago
I checked the robots.txt file of the website:
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://n1info.hr/sitemap.xml
it seems that the site allow any user-agent header. But the colly framework add the default user-agent header if u didn't specify it.
c.UserAgent = "colly - https://github.com/gocolly/colly"
if hdr == nil {
hdr = http.Header{"User-Agent": []string{c.UserAgent}}
}
The current situation is when the user-agent is 'colly - https://github.com/gocolly/colly', the response return 403. so, if you can specify the user-agent heder, the program runs ok.
package main
import (
"fmt"
"strings"
"github.com/gocolly/colly"
)
func main() {
c := colly.NewCollector()
// specify the user-agent header
c.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60"
var articleText string
c.OnError(func(response *colly.Response, err error) {
fmt.Println("Status code:", response.StatusCode, "retrying . . .")
response.Request.Retry()
})
c.OnHTML(
".wrapper-singlepost",
func(e *colly.HTMLElement) {
e.ForEach("h1.entry-title, .entry-content p, .entry-content h2", func(_ int, child *colly.HTMLElement) {
articleText += strings.TrimSpace(child.Text) + " "
})
},
)
c.Visit("https://n1info.hr/magazin/priblizava-nam-se-vrazji-komet-ovu-pojavu-necemo-moci-vidjeti-u-sljedecih-nekoliko-desetljeca/")
fmt.Println(articleText)
}
Hope it will help.
First of all, a big thank you to the creators of the project.
Secondly, a disclaimer: I'm a newbie at Golang and using colly, so this might not actually be a bug.
I'm trying to scrape a news website in order to create a text summatization dataset. Currently, my codebase is pretty large but I've managed to create a small reproducible example:
This code snippet is supposed to extract the article text into the articleText variable and print it, but the website returns a 403 Forbidden status code. What makes this weird is when you try it out using http.Get it returns a 200 OK status code and you can see the article text present in the response body:
I'm not sure whether this is a bug or am I missing something that would make this the expected behaviour. Do you have any advice on solving this issue?