gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.2k stars 1.76k forks source link

Getting final request when there is a page redirect #74

Closed native-human closed 6 years ago

native-human commented 6 years ago

When there is a page redirect, colly automatically follows the redirect. In that case, I get a Request object in the OnHTML callback. It seems that colly provides the original Request and not the one after the redirect. Since I want to follow all the links on the html site, I use the Request object to get the absolute URL. However, in that case this doesn't work as expected, since the Request Object has the wrong URL. The example below illustrates the problem:

package main

import (
    "fmt"
    "net/http"
    "time"

    "github.com/gocolly/colly"
)

func main() {
    go func() {
        http.Handle("/", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            http.Redirect(w, r, "/r/", http.StatusSeeOther)

        }))
        http.Handle("/r/", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            fmt.Fprintf(w, `<a href="test">test</a>`)
        }))
        http.ListenAndServe("127.0.0.1:9999", nil)
    }()
    time.Sleep(500 * time.Millisecond)
    c := colly.NewCollector()
    c.AllowedDomains = []string{"127.0.0.1:9999"}
    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        fmt.Println(e.Request.AbsoluteURL(e.Attr("href")))
    })
    c.Visit("http://127.0.0.1:9999/")
    c.Wait()
    time.Sleep(1000 * time.Hour)
}

The example gives "http://127.0.0.1:9999/test". However when I go to "http://127.0.0.1" via firefox and click on the link, I get redirected to "http://127.0.0.1:9999/r/test".

Is there a better way to mimic the behavior of the browser in this case?

asciimoo commented 6 years ago

@native-human thanks for the detailed report. Hopefully 37c1a91 fixes it, could you confirm?

native-human commented 6 years ago

Thanks, works great now for me!