jungjonghun / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

crawler4j ignores robots.txt #128

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1.put your robots.txt to http://localhost/robots.txt with these lines:
User-agent: * 
Disallow: / 
2.crawl some page of localhost
3.you will get the contents of page you crawled

What is the expected output? What do you see instead?
crawler4j take care of robots.txt

What version of the product are you using?
3.3

Please provide any additional information below.
this is my config about robotstxt.
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
robotstxtConfig.setEnabled(true);
robotstxtConfig.setUserAgentName(userAgent);
robotstxtConfig.setCacheSize(0);
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, 
pageFetcher);

Original issue reported on code.google.com by pikote...@gmail.com on 25 Feb 2012 at 8:08

GoogleCodeExporter commented 9 years ago
for your information, 
Heritrix has Robotstxt class.
heritrix-3.1.0-src\heritrix-3.1.0\modules\src\main\java\org\archive\modules\net\
Robotstxt.java
in example, this supports "Crawl-delay" directive but crawler4j doesn't support 
it.
this is in apache license 2.0 so you can bring it to crawler4j maybe.
(Please make sure by yourself if you do)

Original comment by pikote...@gmail.com on 25 Feb 2012 at 9:32

GoogleCodeExporter commented 9 years ago
it was my mistake, I'm sorry.
mime type of my robots.txt was "text/html" but not "text/plain".
robots.txt should be as text/plain.

but it might be good to think crawler4j supports other mime types.
especially "text/*".
http://www.nextthing.org/archives/2007/03/12/robotstxt-adventure

Original comment by pikote...@gmail.com on 26 Feb 2012 at 3:32

GoogleCodeExporter commented 9 years ago
As you mentioned, it's by design.

-Yasser

Original comment by ganjisaffar@gmail.com on 28 Feb 2012 at 5:47