mohankreddy / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Errornous link url extraction from a html #12

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hi,
I found that the parser extracts links from a html page which are not correct - 
it omits some portion of the path, which results in 404. It specially happen 
sometime when no file is given in the url (default file is being served) or 
when crawling nested folder structure (served by a web server) which ultimately 
leads to files.

What steps will reproduce the problem?
1. try to crawl urls without indicating a file
2. try to crawl a web-server served nested (upto 3/4 levels) folder structure
3.

What is the expected output? What do you see instead?
- Urls misses last part of the url path, and results in 404.

What version of the product are you using? On what operating system?
- v1.8 / Linux 2.6.9-11.ELsmp / jre-1.5.0

Please provide any additional information below.

Thanks.
praveen (pkalwar@gmail.com)

Original issue reported on code.google.com by pkal...@gmail.com on 6 Aug 2010 at 7:24

GoogleCodeExporter commented 9 years ago
Can you provide an example?

Original comment by ganjisaffar@gmail.com on 8 Aug 2010 at 1:29

GoogleCodeExporter commented 9 years ago
It should be fixed in the new version. If not, provide me with examples.

-Yasser

Original comment by ganjisaffar@gmail.com on 11 Mar 2011 at 11:07