When crawling with depth 0 (i.e. just the provided seeds should be crawled)
crawler4j fetches all robots.txt files of all the links found in the seeds.
For example: when crawling "http://code.google.com/p/crawler4j" with depth 0,
"http://code.google.com/robots.txt" should be the only robots.txt file to be
fetched. But crawler4j also fetches "http://www.apache.org/robots.txt" (and
many more). But this is not necessary.
I attached a patch that should fix this issue.
Original issue reported on code.google.com by alexande...@unister-gmbh.de on 18 Aug 2011 at 11:49
Original issue reported on code.google.com by
alexande...@unister-gmbh.de
on 18 Aug 2011 at 11:49Attachments: