jesbin / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Now Seeding Wordpress Hosted Websites #263

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Add seed of any Wordpress Hosted Site (pointed to your domain)
2. For example http://darcyconroy.net/ or http://indianexpress.com/
3. Start controller

What is the expected output? What do you see instead?

Controller stops silently without any error message. It doesn't even reaches 
shouldVisit. 

What version of the product are you using?

Latest

Please provide any additional information below.

I am wondering crawler4j is around for years, but nobody noticed this so far ? 
These sites are hosted on wordpress but are pointed to separate domain name. I 
opened another issue, recently, but found the probable cause of the problem. 

Original issue reported on code.google.com by akshay22...@gmail.com on 2 May 2014 at 6:15

GoogleCodeExporter commented 8 years ago
I think, the wordpress site has some plugins to filter Useragents other than 
just robots.txt.

Enable Logger output. 
BasicConfigurator.configure();

Set Logger to WARN Level.
Logger.getRootLogger().setLevel(Level.WARN);

I can say the crawling is blocked by the server. If you change the UserAgent 
String to Empty with below code, it crawl the data.
config.setUserAgentString(""); Note you can user your name as well.

So i think its nothing to do with crawler4j. Crawler4j sets the default 
Useragent string, which i think blocked or its useragent string is blacklisted 
by such plugins.

Original comment by jeba.ride@gmail.com on 8 May 2014 at 11:35

GoogleCodeExporter commented 8 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:51

GoogleCodeExporter commented 8 years ago
It appears that wordpress.com has blocked our crawler, identified by it's 
userAgent.

Somebody in the past has probably crawled wordpress.com - they saw it as an 
attack on their systems and blocked our userAgent.

This problem is easy to solve though, just set your custom userAgent and you 
can crawl any wordpress.com site you want.

How to do that?
config.setUserAgentString(""); // Set it with any string...

Original comment by avrah...@gmail.com on 20 Aug 2014 at 12:26

GoogleCodeExporter commented 8 years ago
no by setting :
config.setUserAgentString(""); 
 does not work for me.

Original comment by aman.dha...@gmail.com on 3 Nov 2014 at 4:12

GoogleCodeExporter commented 8 years ago
Set it to a known userAgent, just search for known userAgents like these from 
firefox:
http://www.useragentstring.com/pages/Firefox/

So for example try this:
config.setUserAgentString("Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 
Firefox/31.0"); 

Original comment by avrah...@gmail.com on 3 Nov 2014 at 3:12