Not Visiting Certain Seed Urls

jesbin / crawler4j

Automatically exported from code.google.com/p/crawler4j

0 stars 0 forks source link

Not Visiting Certain Seed Urls #262

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1. use the sample code available 
herehttps://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/
crawler4j/examples/basic/

2. add seed url http://indianexpress.com/

3. Run

What is the expected output? What do you see instead?
It is supposed to start crawling, but nothing prints in the eclipse console. 
Not even any error message. I tried debugging and found it doesnt even reach 
shouldVisit method. Would be great if I could atleast get error message.

What version of the product are you using?
latest

Please provide any additional information below.
asked here as well : 
http://stackoverflow.com/questions/23413880/crawler4j-stops-silently

Original issue reported on code.google.com by akshay22...@gmail.com on 1 May 2014 at 6:21

GoogleCodeExporter commented 8 years ago

I think, the wordpress site has some plugins to filter Useragents other than 
just robots.txt.

Enable Logger output. 
BasicConfigurator.configure();

Set Logger to WARN Level.
Logger.getRootLogger().setLevel(Level.WARN);

I can say the crawling is blocked by the server. If you change the UserAgent 
String to Empty with below code, it crawl the data.
config.setUserAgentString(""); Note you can user your name as well.

So i think its nothing to do with crawler4j. Crawler4j sets the default 
Useragent string, which i think blocked or its useragent string is blacklisted 
by such plugins.

Original comment by jeba.ride@gmail.com on 8 May 2014 at 11:32

GoogleCodeExporter commented 8 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:50

Changed state: Accepted
Added labels: Priority-High
Removed labels: Priority-Medium

GoogleCodeExporter commented 8 years ago

It appears that wordpress.com has blocked our crawler, identified by it's 
userAgent.

Somebody in the past has probably crawled wordpress.com - they saw it as an 
attack on their systems and blocked our userAgent.

This problem is easy to solve though, just set your custom userAgent and you 
can crawl any wordpress.com site you want.

How to do that?
config.setUserAgentString(""); // Set it with any string...

Original comment by avrah...@gmail.com on 20 Aug 2014 at 12:26

Changed state: Fixed