crawler will not follow relative URLs in redirects

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1.  Take the simple crawler example; remove all calls to controller.addSeed() 
and replace with this one
controller.addSeed("http://dairymix.com/");
2. This URL redirects. Below are the relevant headers

Server         Microsoft-IIS/6.0
X-Powered-By   ASP.NET
Location       website_import_001.htm

Of importance note that location is a relative URL.

What is the expected output? What do you see instead?
I see an exception.

java.lang.NullPointerException

    at edu.uci.ics.crawler4j.frontier.DocIDServer.getDocID(DocIDServer.java:70)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:143)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:108)
    at java.lang.Thread.run(Unknown Source)

Although technically relative URLs are not valid in the Location header, the 
apache HTTPClient library handles this correctly; it would be reasonable to 
assume Crawler4J  would handle this also. 

What version of the product are you using? On what operating system?
SVN trunk rev.21

Please provide any additional information below.

Please add a extra config to the crawler4j.properties so that we can override 
the default behavior and let the HTTP client library handle the redirects or 
update the  WebCrawler class to handle relative URLs in this case.

Original issue reported on code.google.com by robertop...@gmail.com on 23 May 2011 at 10:23

GoogleCodeExporter commented 8 years ago

I also see same issue

Original comment by jss.a...@gmail.com on 9 Jun 2011 at 7:58

GoogleCodeExporter commented 8 years ago

I made this code in Page Fetcher and works with relative urls in redirect.

if (statusCode == HttpStatus.SC_MOVED_PERMANENTLY || statusCode == 
HttpStatus.SC_MOVED_TEMPORARILY) 
{
Header header = response.getFirstHeader("Location");
if (header != null) {
String movedToUrl = header.getValue();                                          
              if(!movedToUrl.contains("http://")) 
{
movedToUrl = get.getURI().getScheme() + "://" + get.getURI().getHost() + 
movedToUrl;
}                           page.getWebURL().setURL(movedToUrl);
} else {                            page.getWebURL().setURL(null);
}
return PageFetchStatus.Moved;
}

Original comment by DLopezGo...@gmail.com on 15 Jun 2011 at 12:57

GoogleCodeExporter commented 8 years ago

I think it should be the following:

if(!movedToUrl.startsWith("http://") || !movedToUrl.startsWith("https://"))

Original comment by Sunshine...@sohu.com on 19 Aug 2011 at 4:00

GoogleCodeExporter commented 8 years ago

Hi,

In the last suggestion, the get.getURI().getPath() is missing as a connector
this patch should solve this.

Original comment by u...@taykey.com on 30 Aug 2011 at 3:55

Attachments:

PageFetcher.patch

GoogleCodeExporter commented 8 years ago

I think it should be the following instead:

if(!movedToUrl.startsWith("http://" && !movedToUrl.startsWith("https://") 

because url can not contain "http://" and "https://" simultaneously.

Original comment by lance.ch...@gmail.com on 18 Sep 2011 at 2:55

GoogleCodeExporter commented 8 years ago

u...@taykey.com :

I  tried your patch, and can't seem to understand the error I'm getting. It 
seems there is an extraneous 'else{}' in there that I removed, but it seems 
like toFetchURL is appending several different URLS into one, since I get this 
error message:

 INFO [Crawler 1] Failed: HTTP/1.1 502 Connection reset by peer, while fetching
http://www.flickr.com/signup/https://login.yahoo.com/config/login/photos/signup/
https://login.yahoo.com/config/login/photos/signup/https://login.yahoo.com/confi
g/login/photos/signup/https://login.yahoo.com/config/login/photos/signup/https:/
/login.yahoo.com/config/login/photos/signup/https://login.yahoo.com/config/login

etc.

Any thoughts? Sorry to bother you, I guess it would seem like a missing space 
somewhere, but I inserted your patch at the proper place of PageFetcher, 
perhaps you could shed some light. Thanks in advance.

Original comment by Geoffrey...@gmail.com on 17 Oct 2011 at 6:03

GoogleCodeExporter commented 8 years ago

This issue is resolved in version 3.0

-Yasser

Original comment by ganjisaffar@gmail.com on 2 Jan 2012 at 5:31

Changed state: Fixed
Added labels: Type-Enhancement
Removed labels: Type-Defect

ljhsecret / crawler4j

crawler will not follow relative URLs in redirects #50