sjdirect / abotx

Cross Platform C# Web crawler framework, headless browser, parallel crawler. Please star this project! +1.
https://abotx.org
131 stars 23 forks source link

AbotX not respecting CrawlConfigurationX MaxPagesToCrawlPerDomain and MaxCrawlDepth #20

Closed hillhousehold closed 4 years ago

hillhousehold commented 4 years ago

I'm testing this and the crawler is currently on page ~55,000 and several layers deep for one of the three domains that I am crawling on the test. The code I used to load the configuration is below. I load from the app config xml and then override some of the settings in the method to customize the crawl based on user input for specific test crawls that I'm running. The two values in question are hard coded to 1000 and 1 respectively for this test. Am I doing something wrong?

var config = AbotXConfigurationSectionHandler.LoadFromXml().Convert(); config.CrawlTimeoutSeconds = timeoutMilliseconds / 1000; config.HttpRequestTimeoutInSeconds = timeoutMilliseconds / 1000; config.JavascriptRenderingWaitTimeInMilliseconds = timeoutMilliseconds; config.MaxCrawlDepth = 1; //set for testing only config.JavascriptRenderingWaitTimeInMilliseconds = javascriptTimeout; config.MaxPagesToCrawlPerDomain = 1000; //set for testing only ParallelImplementationOverride impls = new ParallelImplementationOverride(config); impls.SiteToCrawlProvider.AddSitesToCrawl(sites); ParallelCrawlerEngine crawlEngine = new ParallelCrawlerEngine(config, impls);

sjdirect commented 4 years ago

Can you add this line...

impls.WebCrawlerFactory = new WebCrawlerFactory(config); //!!!!!!!!!!!!!!!!!This is new!!!!!

before this one... ParallelCrawlerEngine crawlEngine = new ParallelCrawlerEngine(config, impls);

Let me know if that solves your problem.

hillhousehold commented 4 years ago

Thank you, but it is still not working properly. I added the line of code, recompiled, and ran a test with 8 Domains and 1000 MaxPagesToCrawlPerDomain. It is currently surpassing 15,000 pages crawled. It shouldn't have gone past 8000 pages.

My counter is incremented each time the crawler fires the PageCrawlCompletedAsync() event.

hillhousehold commented 4 years ago

Also, my project is running Version 1.3.81 of AbotX and 1.6.0.5 of Abot

hillhousehold commented 4 years ago

I was able to get it working by using the SiteToCrawl.CrawlConfiguration property instead of a global crawl configuration for the crawl.

sjdirect commented 4 years ago

I've been unable to reproduce this issue. If this springs up again in version 2.0+ feel free to reopen