What steps will reproduce the problem?
1. Crawl a site http://a.b/c/d/e.html where the HTML contains <base
href="http://a.b/c/">
2. Any relative links in the page will be wrongly extracted, e.g. "../x.html"
will be extracted as "http://a.b/c/x.html" instead of "http://a.b/x.html"
What is the expected output? What do you see instead?
Any relative links in the page will be wrongly extracted, e.g. "../x.html" will
be extracted as "http://a.b/c/x.html" instead of "http://a.b/x.html"
What version of the product are you using? On what operating system?
version 2.2 and latest build from SVN. Windows 7.
Please provide any additional information below.
The attached patch on /src/edu/uci/ics/crawler4j/crawler/HTMLParser.java may
help.
Original issue reported on code.google.com by hoiwai1...@gmail.com on 31 Dec 2010 at 2:42
Original issue reported on code.google.com by
hoiwai1...@gmail.com
on 31 Dec 2010 at 2:42Attachments: