gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.3k stars 1.76k forks source link

how to by pass c.OnError #799

Open quangnx99 opened 10 months ago

quangnx99 commented 10 months ago

when I scrapping data, page return http status 404 but result still have html response. I want get response. But in colly, if OnError occurred then onHTML do not occurre. How can I get response when error?

quangnx99 commented 10 months ago

when I scrapping data, page return http status 404 but result still have html response. I want get response. But in colly, if OnError occurred then onHTML do not occurre. How can I get response when error?

I resolved with using property ParseHTTPErrorResponse in OnRequest

    c.OnRequest(func(r *colly.Request) {
        c.ParseHTTPErrorResponse = true
    })
oliverbenns commented 5 months ago

I also have this issue where a website returns 410 Gone but still provides the html body, yet it'll fail in colly. ParseHTTPErrorResponse does not seem to work, nor is it ideal as I'd still like to error on other codes.

oliverbenns commented 5 months ago

You can hack around the OnError function receiver but honestly it's very gross because you're limited in how much you can hook into the Colly logic (really you want to push onto the on http callback slice, but it's private)

I strongly suggest doing this outside of colly with a std http request + goquery instead of the below.

func (c *Client) GetPage(_ context.Context, id string) (*PageResult, error) {
    pageUrl := "http://google.com"
    col := colly.NewCollector()
    var pageModel *PageModel
    col.UserAgent = userAgent

    var err error

    col.OnError(func(res *colly.Response, collyErr error) {
        if res.StatusCode != http.StatusOK && res.StatusCode != http.StatusGone {
            err = fmt.Errorf("invalid status code for page %s: %w", pageUrl, err)
            return
        }

        doc, err := goquery.NewDocumentFromReader(bytes.NewBuffer(res.Body))
        if err != nil {
            err = fmt.Errorf("could not parse response body: %w", err)
            return
        }

        doc.Find("script").Each(func(i int, s *goquery.Selection) {
            if i == 0 {
                         pageModel = s.Text()
                     }
        })
    })

    _ = col.Visit(pageUrl)
    if err != nil {
        return nil, fmt.Errorf("could not visit %s: %w", pageUrl, err)
    }

    return &PageResult{
        Model: pageModel,
    }, nil
}