Closed tamoyal closed 6 years ago
@tamoyal I'm not sure if I understand the bug, but I think the url parsing you mentioned is identical to the browsers url parsing. Without the trailing slash the browser replaces the last part of the path and with slash it appends the relative urls.
Here's a small python webapp to test the issue:
from flask import Flask
app = Flask(__name__)
@app.route('/')
def index():
return '<a href="uno">uno</a><br /><a href="uno/">uno/</a>'
@app.route('/uno')
def uno():
return '<a href="dos">dos</a>'
@app.route('/uno/')
def uno2():
return '<a href="dos">dos</a>'
app.run()
The link on the first page (uno()
) resolved to /dos
and the second (uno2()
) resolved to /uno/dos
in the browser.
Chrome doesn't behave like this. Go here and hit view source. Then ctrl-f for href="what-to-do-in-milwaukee"
and click on what-to-do-in-milwaukee
which chrome shows as a link when you view source. You will see it goes to http://www.fourpointsmilwaukeenorth.com/make-a-green-choice
and not http://www.fourpointsmilwaukeenorth.com/rooms/make-a-green-choice
Colly produces the same output:
package main
import (
"log"
"github.com/gocolly/colly"
)
func main() {
c := colly.NewCollector()
c.OnHTML("a[href=what-to-do-in-milwaukee]", func(e *colly.HTMLElement) {
log.Println(e.Request.AbsoluteURL(e.Attr("href")))
})
c.Visit("http://www.fourpointsmilwaukeenorth.com/rooms")
}
output:
2018/04/08 20:02:45 http://www.fourpointsmilwaukeenorth.com/what-to-do-in-milwaukee
2018/04/08 20:02:45 http://www.fourpointsmilwaukeenorth.com/what-to-do-in-milwaukee
2018/04/08 20:02:45 http://www.fourpointsmilwaukeenorth.com/what-to-do-in-milwaukee
2018/04/08 20:02:45 http://www.fourpointsmilwaukeenorth.com/what-to-do-in-milwaukee
Weird. With this code:
package main
import (
"fmt"
"github.com/gocolly/colly"
)
func main() {
c := colly.NewCollector(
colly.MaxDepth(3),
colly.Async(true),
colly.AllowedDomains("www.fourpointsmilwaukeenorth.com"),
)
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
href := e.Attr("href")
if href == "milwaukee-restaurant" {
fmt.Println("From:", e.Request.URL.String())
fmt.Println(e.Attr("href"))
fmt.Println(e.Request.AbsoluteURL(e.Attr("href")))
}
c.Visit(e.Request.AbsoluteURL(e.Attr("href")))
})
c.Visit("http://www.fourpointsmilwaukeenorth.com/")
c.Wait()
}
I get plenty of:
...
milwaukee-restaurant
http://www.fourpointsmilwaukeenorth.com/rooms/en/milwaukee-restaurant
...
Any idea how that could happen?
There is no en/milwaukee-restaurant
link on the linked page (http://www.fourpointsmilwaukeenorth.com/rooms) . So, probably another subpage of the site contains these buggy links.
@asciimoo I updated the code to print the source url of the links. See above ^^^
I get this output:
...
From: http://www.fourpointsmilwaukeenorth.com/en/rooms/accessible
milwaukee-restaurant
http://www.fourpointsmilwaukeenorth.com/en/rooms/milwaukee-restaurant
...
So on the "From" page there is an href with milwaukee-restaurant
that colly resolves to http://www.fourpointsmilwaukeenorth.com/en/rooms/milwaukee-restaurant
which is an invalid URL. Note that I am printing the href and printing the result of calling the AbsoluteURL
method on that href right below it. The pages do not have buggy links.
This output seems valid to me.
I think I found the problem: the page has a <base href="http://www.fourpointsmilwaukeenorth.com/" />
tag, so the URL should be resolved to http://www.fourpointsmilwaukeenorth.com/milwaukee-restaurant
and Colly doesn't handle <base />
tag currently.
base
tag...learn something new every day. So should we reopen this or want me to start a new issue (feature request) with this as an example?
@tamoyal hopefully 3439a67 fixes the bug. Can you confirm?
fixed. thank you @asciimoo
I'm relatively new to golang so while I realize the url parsing is done using golang's core libraries, I have still found an issue that may be valuable to solve in a crawling project.
If you look at this code (currently line 57 here):
It appears to be doing the right thing for handling a case we see all too often in html:
<a href="home">Home</a>
However, it actually only handles this properly (I use "properly" loosely) if
r.URL
does not have a trailing slash. For example:Produces
http://www.blah.com/dos
The problem when there is a trailing slash is:
Produces
http://www.blah.com/uno/dos
So if the page the crawler is currently pulling an
href
that is ...let's call it ...poorly formed? It will actually send the crawler to a bunch of bad urls. What do you think the right way to handle this is?