lidoapps / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

How to make this a focused crawler? #119

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
I want to make a focused crawler which involves a trained classifier to 
determine the relevance of retrieved pages. So the URLs are added to be crawled 
only if that page is considered as relevant by the classifier. What classes 
should I need to look up? 

Thanks!

Original issue reported on code.google.com by smsa...@gmail.com on 1 Feb 2012 at 6:54

GoogleCodeExporter commented 9 years ago
I'm using Crawler4j 3.1 version. 

Original comment by smsa...@gmail.com on 1 Feb 2012 at 7:03

GoogleCodeExporter commented 9 years ago
You just need to follow the basic example: 
http://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/crawl
er4j/examples/basic/

This example is a general crawler. If you want to have a focused crawler, you 
should implement the logic in the shouldVisit method of the Crawler.

-Yasser

Original comment by ganjisaffar@gmail.com on 4 Feb 2012 at 11:37

GoogleCodeExporter commented 9 years ago
Hi,

Would you be able to given an example of which code to put in the shouldvisit 
method to get the following results:

Topic="Premier League" (Ie I want to crawl all websites regarding that topic)
Only websites ending with "co.uk"

Also, does Crawler4j use any pagerank?

Many thanks,

Olivier

Original comment by mrkevind...@gmail.com on 2 Aug 2013 at 2:42