Closed GoogleCodeExporter closed 9 years ago
I am using the code from trunk and you can do something like this.
public boolean shouldVisit(WebURL url) {
List<String> domainsToCrawl = (List<String>) this.getMyController().getCustomData();
String href = url.getURL().toLowerCase();
if (filters.matcher(href).matches()) {
return false;
}
for(String domain : this.getDomainsToCrawl()){
if (href.startsWith(domain)) {
return true;
}
}
return false;
}
And setup the controller something like this:
List<String> domainsToCrawl = new ArrayList<String>();
domainsToCrawl.add("http://www.example.com");
domainsToCrawl.add("http://www.example.net");
domainsToCrawl.add("http://www.example.org");
controller.setCustomData(domainsToCrawl);
controller.start(IndexCrawler.class, 1);
Original comment by tahs...@trademango.com
on 10 Jan 2012 at 7:36
Original comment by ganjisaffar@gmail.com
on 23 Jan 2012 at 12:15
Hi i did it like that:
String[] urls={"http://www.url1/", "http://www.url.co.uk/home/index.html", "http://www.url3.com/index.htm", "http://www.url4.com/"}
for(String url : urls){
CrawlConfig config = new CrawlConfig();
/** here is ur configurations of config **/
controller.setCustomData(url);
controller.addSeed(url);
controller.start(MyCrawler.class, 1);
}
And in the ShoulVisit Method:
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
return !FILTERS.matcher(href).matches() && (href.startsWith(this.getMyController().getCustomData().toString())) ;
}
I hope that helps
Original comment by ju...@gmx.net
on 13 Jul 2014 at 1:50
Original comment by avrah...@gmail.com
on 11 Aug 2014 at 12:59
Original issue reported on code.google.com by
marek.hu...@gmail.com
on 17 Nov 2011 at 8:00