seantanwh / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Crowler doesn't crawl some page. #270

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1.Use basic example of crawler
2.Add one seed: http://melodicyoga.wordpress.com/links/
3.controller.start(MyCrawler.class, 1);  doesn't start. I mean shouldVisit and 
visit methods are not triggered.

Original issue reported on code.google.com by dariusz....@gmail.com on 6 Aug 2014 at 8:10

GoogleCodeExporter commented 8 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:51

GoogleCodeExporter commented 8 years ago
It appears that wordpress.com has blocked our crawler, identified by it's 
userAgent.

Somebody in the past has probably crawled wordpress.com - they saw it as an 
attack on their systems and blocked our userAgent.

This problem is easy to solve though, just set your custom userAgent and you 
can crawl any wordpress.com site you want.

How to do that?
config.setUserAgentString(""); // Set it with any string...

Original comment by avrah...@gmail.com on 20 Aug 2014 at 12:26

GoogleCodeExporter commented 8 years ago
not working...

Original comment by ayush.me...@gmail.com on 29 Nov 2014 at 2:09

GoogleCodeExporter commented 8 years ago
Hi Ayush,

Please put a specific scenario and tell me what is not working so I can test it 
out.

If something doesn't work - I will fix it.

Original comment by avrah...@gmail.com on 30 Nov 2014 at 8:42

GoogleCodeExporter commented 8 years ago
No scenario - tagged as invalid

Original comment by avrah...@gmail.com on 22 Jan 2015 at 11:43