mohankreddy / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Errornous link URL extraction if the HTML contains <base href="..."> #24

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Crawl a site http://a.b/c/d/e.html where the HTML contains <base 
href="http://a.b/c/">
2. Any relative links in the page will be wrongly extracted, e.g. "../x.html" 
will be extracted as "http://a.b/c/x.html" instead of "http://a.b/x.html"

What is the expected output? What do you see instead?
Any relative links in the page will be wrongly extracted, e.g. "../x.html" will 
be extracted as "http://a.b/c/x.html" instead of "http://a.b/x.html"

What version of the product are you using? On what operating system?
version 2.2 and latest build from SVN. Windows 7.

Please provide any additional information below.
The attached patch on /src/edu/uci/ics/crawler4j/crawler/HTMLParser.java may 
help.

Original issue reported on code.google.com by hoiwai1...@gmail.com on 31 Dec 2010 at 2:42

Attachments:

GoogleCodeExporter commented 9 years ago
Thanks for providing the patch. I added it to the latest version.

-Yasser

Original comment by ganjisaffar@gmail.com on 11 Mar 2011 at 11:04