smasher125354 / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Add a filtering class to handle more easily URL filtering #220

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
It is my whishlist :-)

Please,  can you include these two classes in your engine. To ease the URL 
filtering process. A take this from nutch package and changed this a bit to fit 
my needs (initially it was a nutch plugin - now it is standalone).

You can filter URL with regexps by using '+' or '-' to include or exclude URLs.

files:
- "regex-urlfilter.crawl.txt": one example i use during crawling ;
- "RegexRule.java" and "RegexURLFilter.java": the two main classes ;
- "SampleCrawler.java": the sample crawler ;

I hope it will help.

Regards,
Emmanuel

Original issue reported on code.google.com by zygolech...@gmail.com on 13 May 2013 at 6:54

Attachments:

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:39