gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.2k stars 1.76k forks source link

Issue parsing relative urls #125

Closed tamoyal closed 6 years ago

tamoyal commented 6 years ago

I'm relatively new to golang so while I realize the url parsing is done using golang's core libraries, I have still found an issue that may be valuable to solve in a crawling project.

If you look at this code (currently line 57 here):

func (r *Request) AbsoluteURL(u string) string {
    if strings.HasPrefix(u, "#") {
        return ""
    }
    absURL, err := r.URL.Parse(u)
    if err != nil {
        return ""
    }
    absURL.Fragment = ""
    if absURL.Scheme == "//" {
        absURL.Scheme = r.URL.Scheme
    }
    return absURL.String()
}

It appears to be doing the right thing for handling a case we see all too often in html: <a href="home">Home</a>

However, it actually only handles this properly (I use "properly" loosely) if r.URL does not have a trailing slash. For example:

u, err := url.Parse("http://www.blah.com/uno")
next, err := u.Parse("dos")
next.String()

Produces http://www.blah.com/dos

The problem when there is a trailing slash is:

u, err := url.Parse("http://www.blah.com/uno/")
next, err := u.Parse("dos")
next.String()

Produces http://www.blah.com/uno/dos

So if the page the crawler is currently pulling an href that is ...let's call it ...poorly formed? It will actually send the crawler to a bunch of bad urls. What do you think the right way to handle this is?

asciimoo commented 6 years ago

@tamoyal I'm not sure if I understand the bug, but I think the url parsing you mentioned is identical to the browsers url parsing. Without the trailing slash the browser replaces the last part of the path and with slash it appends the relative urls.

Here's a small python webapp to test the issue:

from flask import Flask

app = Flask(__name__)

@app.route('/')
def index():
    return '<a href="uno">uno</a><br /><a href="uno/">uno/</a>'

@app.route('/uno')
def uno():
    return '<a href="dos">dos</a>'

@app.route('/uno/')
def uno2():
    return '<a href="dos">dos</a>'

app.run()

The link on the first page (uno()) resolved to /dos and the second (uno2()) resolved to /uno/dos in the browser.

tamoyal commented 6 years ago

Chrome doesn't behave like this. Go here and hit view source. Then ctrl-f for href="what-to-do-in-milwaukee" and click on what-to-do-in-milwaukee which chrome shows as a link when you view source. You will see it goes to http://www.fourpointsmilwaukeenorth.com/make-a-green-choice and not http://www.fourpointsmilwaukeenorth.com/rooms/make-a-green-choice

asciimoo commented 6 years ago

Colly produces the same output:

package main

import (
    "log"

    "github.com/gocolly/colly"
)

func main() {
    c := colly.NewCollector()

    c.OnHTML("a[href=what-to-do-in-milwaukee]", func(e *colly.HTMLElement) {
        log.Println(e.Request.AbsoluteURL(e.Attr("href")))
    })

    c.Visit("http://www.fourpointsmilwaukeenorth.com/rooms")
}

output:

2018/04/08 20:02:45 http://www.fourpointsmilwaukeenorth.com/what-to-do-in-milwaukee
2018/04/08 20:02:45 http://www.fourpointsmilwaukeenorth.com/what-to-do-in-milwaukee
2018/04/08 20:02:45 http://www.fourpointsmilwaukeenorth.com/what-to-do-in-milwaukee
2018/04/08 20:02:45 http://www.fourpointsmilwaukeenorth.com/what-to-do-in-milwaukee
tamoyal commented 6 years ago

Weird. With this code:

package main

import (
  "fmt"

  "github.com/gocolly/colly"
)

func main() {
  c := colly.NewCollector(
    colly.MaxDepth(3),
    colly.Async(true),
    colly.AllowedDomains("www.fourpointsmilwaukeenorth.com"),
  )

  c.OnHTML("a[href]", func(e *colly.HTMLElement) {
    href := e.Attr("href")
    if href == "milwaukee-restaurant" {
      fmt.Println("From:", e.Request.URL.String())
      fmt.Println(e.Attr("href"))
      fmt.Println(e.Request.AbsoluteURL(e.Attr("href")))
    }
    c.Visit(e.Request.AbsoluteURL(e.Attr("href")))
  })

  c.Visit("http://www.fourpointsmilwaukeenorth.com/")

  c.Wait()
}

I get plenty of:

...
milwaukee-restaurant
http://www.fourpointsmilwaukeenorth.com/rooms/en/milwaukee-restaurant
...

Any idea how that could happen?

asciimoo commented 6 years ago

There is no en/milwaukee-restaurant link on the linked page (http://www.fourpointsmilwaukeenorth.com/rooms) . So, probably another subpage of the site contains these buggy links.

tamoyal commented 6 years ago

@asciimoo I updated the code to print the source url of the links. See above ^^^

I get this output:

...
From: http://www.fourpointsmilwaukeenorth.com/en/rooms/accessible
milwaukee-restaurant
http://www.fourpointsmilwaukeenorth.com/en/rooms/milwaukee-restaurant
...

So on the "From" page there is an href with milwaukee-restaurant that colly resolves to http://www.fourpointsmilwaukeenorth.com/en/rooms/milwaukee-restaurant which is an invalid URL. Note that I am printing the href and printing the result of calling the AbsoluteURL method on that href right below it. The pages do not have buggy links.

asciimoo commented 6 years ago

This output seems valid to me. I think I found the problem: the page has a <base href="http://www.fourpointsmilwaukeenorth.com/" /> tag, so the URL should be resolved to http://www.fourpointsmilwaukeenorth.com/milwaukee-restaurant and Colly doesn't handle <base /> tag currently.

tamoyal commented 6 years ago

base tag...learn something new every day. So should we reopen this or want me to start a new issue (feature request) with this as an example?

asciimoo commented 6 years ago

@tamoyal hopefully 3439a67 fixes the bug. Can you confirm?

tamoyal commented 6 years ago

fixed. thank you @asciimoo