MY BasicCrawlerController.java
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig,
pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher,
robotstxtServer);
controller.addSeed("http://www.parent.com/parent.html");
controller.setCustomData("http://www.parent.com/");
controller.start(BasicCrawler.class, numberOfCrawlers);
My BasicCrawler.java
private final static Pattern BINARY_FILES_EXTENSIONS =
Pattern.compile(".*\\.(bmp|gif|jpe?g|png|tiff?|pdf|ico|xaml|pict|rif|pptx?|ps" +
"|mid|mp2|mp3|mp4|wav|wma|au|aiff|flac|ogg|3gp|aac|amr|au|vox" +
"|avi|mov|mpe?g|ra?m|m4v|smil|wm?v|swf|aaf|asf|flv|mkv" +
"|zip|rar|gz|7z|aac|ace|alz|apk|arc|arj|dmg|jar|lzip|lha)" +
"(\\?.*)?$"); // For url Query parts ( URL?q=... )
/**
* You should implement this function to specify whether the given url
* should be crawled or not (based on your crawling logic).
*/
public boolean shouldVisit(Page page, WebURL url) {
String href = url.getURL().toLowerCase();
System.out.println("checking BINARY_FILES_EXTENSIONS");
return !BINARY_FILES_EXTENSIONS.matcher(href).matches() && href.startsWith("http://www.parent.com/");
}
if the http://parent.com/parent.html page has http://www.parent.com/child.html
anchor then it can crawl, if it has anchor like http://www.sub.com/sub.html it
should not crawl but for me crawling sub.com also
I am using crawler4j-3.5.jar and their dependencies. help me to solve the
solution
Original issue reported on code.google.com by kselva92 on 18 May 2015 at 12:43
Original issue reported on code.google.com by
kselva92
on 18 May 2015 at 12:43