opensangja / abot

Automatically exported from code.google.com/p/abot
Apache License 2.0
0 stars 0 forks source link

Add config value for MaxPagesToCrawlPerDomain #51

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Consider adding a config value for MaxPagesToCrawlPerDomain.

Original issue reported on code.google.com by sjdir...@gmail.com on 5 Dec 2012 at 8:29

GoogleCodeExporter commented 9 years ago
1) Create a config value MaxPagesToCrawlPerDomain? in the 
CrawlConfiguration?.cs file and have .net fill it with the config section (like 
the other properties in that class) 2) Extend CrawlDecisionMaker?.cs 3) Add a 
ConcurrentDictionary?<string, int> that keeps track of the domains that have 
been crawled and the current count for each domain 4) Override ShouldCrawlPage? 
method and have it addto/check the dictionary to be sure a domain is not 
crawled more than x times. 3) Pass in your implementation

   WebCrawler crawler = new WebCrawler(
        null, 
        null, 
        null, 
        null, 
        null,
        new YourCrawlDecisionMaker(), 
        null);

Original comment by sjdir...@gmail.com on 5 Dec 2012 at 8:56

GoogleCodeExporter commented 9 years ago
Be sure to update the forum at 
https://groups.google.com/forum/#!topic/abot-web-crawler/HFu0DUGN9eU

Original comment by sjdir...@gmail.com on 5 Dec 2012 at 9:06

GoogleCodeExporter commented 9 years ago

Original comment by sjdir...@gmail.com on 10 Dec 2012 at 8:15