Unnecessary fetching of robots.txt files

GoogleCodeExporter commented 9 years ago

When crawling with depth 0 (i.e. just the provided seeds should be crawled) 
crawler4j fetches all robots.txt files of all the links found in the seeds. 

For example: when crawling "http://code.google.com/p/crawler4j" with depth 0, 
"http://code.google.com/robots.txt" should be the only robots.txt file to be 
fetched. But crawler4j also fetches "http://www.apache.org/robots.txt" (and 
many more). But this is not necessary.

I attached a patch that should fix this issue.

Original issue reported on code.google.com by alexande...@unister-gmbh.de on 18 Aug 2011 at 11:49

Attachments:

WebCrawler.java.patch

GoogleCodeExporter commented 9 years ago

Original comment by ganjisaffar@gmail.com on 2 Jan 2012 at 7:55

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Thanks for reporting. This is fixed in version 3.0

-Yasser

Original comment by ganjisaffar@gmail.com on 2 Jan 2012 at 8:04

Changed state: Fixed

mohankreddy / crawler4j

Unnecessary fetching of robots.txt files #73