jesbin / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Restricting crawler to crawl within the given page domain #342

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
MY BasicCrawlerController.java

PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, 
pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, 
robotstxtServer);
controller.addSeed("http://www.parent.com/parent.html");
controller.setCustomData("http://www.parent.com/");
controller.start(BasicCrawler.class, numberOfCrawlers);

My BasicCrawler.java

  private final static Pattern BINARY_FILES_EXTENSIONS =
        Pattern.compile(".*\\.(bmp|gif|jpe?g|png|tiff?|pdf|ico|xaml|pict|rif|pptx?|ps" +
        "|mid|mp2|mp3|mp4|wav|wma|au|aiff|flac|ogg|3gp|aac|amr|au|vox" +
        "|avi|mov|mpe?g|ra?m|m4v|smil|wm?v|swf|aaf|asf|flv|mkv" +
        "|zip|rar|gz|7z|aac|ace|alz|apk|arc|arj|dmg|jar|lzip|lha)" +
        "(\\?.*)?$"); // For url Query parts ( URL?q=... )

  /**
   * You should implement this function to specify whether the given url
   * should be crawled or not (based on your crawling logic).
   */
  public boolean shouldVisit(Page page, WebURL url) {
      String href = url.getURL().toLowerCase();
      System.out.println("checking BINARY_FILES_EXTENSIONS");
      return !BINARY_FILES_EXTENSIONS.matcher(href).matches() && href.startsWith("http://www.parent.com/");
  }

if the http://parent.com/parent.html page has http://www.parent.com/child.html 
anchor then it can crawl, if it has anchor like http://www.sub.com/sub.html it 
should not crawl but for me crawling sub.com also 

I am using crawler4j-3.5.jar and their dependencies. help me to solve the 
solution

Original issue reported on code.google.com by kselva92 on 18 May 2015 at 12:43