Open quangnx99 opened 10 months ago
when I scrapping data, page return http status 404 but result still have html response. I want get response. But in colly, if OnError occurred then onHTML do not occurre. How can I get response when error?
I resolved with using property ParseHTTPErrorResponse
in OnRequest
c.OnRequest(func(r *colly.Request) {
c.ParseHTTPErrorResponse = true
})
I also have this issue where a website returns 410 Gone but still provides the html body, yet it'll fail in colly. ParseHTTPErrorResponse
does not seem to work, nor is it ideal as I'd still like to error on other codes.
You can hack around the OnError function receiver but honestly it's very gross because you're limited in how much you can hook into the Colly logic (really you want to push onto the on http callback slice, but it's private)
I strongly suggest doing this outside of colly with a std http request + goquery instead of the below.
func (c *Client) GetPage(_ context.Context, id string) (*PageResult, error) {
pageUrl := "http://google.com"
col := colly.NewCollector()
var pageModel *PageModel
col.UserAgent = userAgent
var err error
col.OnError(func(res *colly.Response, collyErr error) {
if res.StatusCode != http.StatusOK && res.StatusCode != http.StatusGone {
err = fmt.Errorf("invalid status code for page %s: %w", pageUrl, err)
return
}
doc, err := goquery.NewDocumentFromReader(bytes.NewBuffer(res.Body))
if err != nil {
err = fmt.Errorf("could not parse response body: %w", err)
return
}
doc.Find("script").Each(func(i int, s *goquery.Selection) {
if i == 0 {
pageModel = s.Text()
}
})
})
_ = col.Visit(pageUrl)
if err != nil {
return nil, fmt.Errorf("could not visit %s: %w", pageUrl, err)
}
return &PageResult{
Model: pageModel,
}, nil
}
when I scrapping data, page return http status 404 but result still have html response. I want get response. But in colly, if OnError occurred then onHTML do not occurre. How can I get response when error?