crawler4j ignores robots.txt

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1.put your robots.txt to http://localhost/robots.txt with these lines:
User-agent: * 
Disallow: / 
2.crawl some page of localhost
3.you will get the contents of page you crawled

What is the expected output? What do you see instead?
crawler4j take care of robots.txt

What version of the product are you using?
3.3

Please provide any additional information below.
this is my config about robotstxt.
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
robotstxtConfig.setEnabled(true);
robotstxtConfig.setUserAgentName(userAgent);
robotstxtConfig.setCacheSize(0);
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, 
pageFetcher);

Original issue reported on code.google.com by pikote...@gmail.com on 25 Feb 2012 at 8:08

GoogleCodeExporter commented 9 years ago

for your information, 
Heritrix has Robotstxt class.
heritrix-3.1.0-src\heritrix-3.1.0\modules\src\main\java\org\archive\modules\net\
Robotstxt.java
in example, this supports "Crawl-delay" directive but crawler4j doesn't support 
it.
this is in apache license 2.0 so you can bring it to crawler4j maybe.
(Please make sure by yourself if you do)

Original comment by pikote...@gmail.com on 25 Feb 2012 at 9:32

GoogleCodeExporter commented 9 years ago

it was my mistake, I'm sorry.
mime type of my robots.txt was "text/html" but not "text/plain".
robots.txt should be as text/plain.

but it might be good to think crawler4j supports other mime types.
especially "text/*".
http://www.nextthing.org/archives/2007/03/12/robotstxt-adventure

Original comment by pikote...@gmail.com on 26 Feb 2012 at 3:32

GoogleCodeExporter commented 9 years ago

As you mentioned, it's by design.

-Yasser

Original comment by ganjisaffar@gmail.com on 28 Feb 2012 at 5:47

Changed state: Invalid
Added labels: Type-Other
Removed labels: Type-Defect

jungjonghun / crawler4j

crawler4j ignores robots.txt #128