gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.31k stars 1.77k forks source link

How do you bypass cookie consent? #585

Closed DoYouKnowTheAnswer closed 3 years ago

DoYouKnowTheAnswer commented 3 years ago

The website I'm attempting to scrape has a pop-up modal on the initial page load. It requires you to click the "Accept All Cookies" button before allowing access to the page.

How do I bypass this?

Current code:

    c := colly.NewCollector(colly.MaxBodySize(math.MaxInt32))
    timeout := 120 * time.Second
    c.SetRequestTimeout(timeout)

        c.OnHTML("tbody", func(e *colly.HTMLElement) {
        log.Println("Scraping ASX page...")
        e.ForEach("tr", func(_ int, elem *colly.HTMLElement) {
        log.Println(elem.ChildAttr("a:first-child", "href"))
                })
         })

         c.Visit("https://www2.asx.com.au/markets/trade-our-cash-market/announcements.apt")

The OnHTML function never runs because it either needs a cookie in the context or that button needs to be clicked. How do I resolve this?

sonu27 commented 3 years ago

This is probably out of scope for this. I'd suggest using a headless web browser instead.

WGH- commented 3 years ago

Most cookie consent dialogs are implemented with JavaScript, and thus are invisible to simple crawlers.

The real problem with this particular page is not the cookie consent modal. The problem is that the data is loaded with Ajax. Check your browser's network tab, you likely want to request that JSON-serving endpoint directly instead.