sjdirect / abot

Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
Apache License 2.0
2.25k stars 560 forks source link

Different results from crawls with same settings #175

Closed benempson closed 6 years ago

benempson commented 6 years ago

Hi there, I've noticed that I get different results from running crawls on the same site with the same settings, generally that more pages are found on each subsequent crawl. The site does not have javascript generated links and I'm using Abot only, not AbotX.

I ran some tests this morning. The results were as follows:

  1. 102 urls found in 1m44s
  2. 189 urls found in 1m42s
  3. 272 urls found in 1m12s
  4. 272 urls found in 30s
  5. 272 urls found in 18s
  6. 272 urls found in 20s
  7. Process was killed and restarted. 272 urls found in 25s.

The elapsed times also seem significant in that the final 3 crawls were much shorter than the first 3, this seems to imply some sort of cache. However, the fact that the last test found 272 urls and also in a short time after the process was killed and restarted goes against that.

Here's my code:

`CrawlConfiguration crawlConfig = new CrawlConfiguration();
crawlConfig.CrawlTimeoutSeconds = 100;
crawlConfig.HttpRequestTimeoutInSeconds = 100;
crawlConfig.MaxConcurrentThreads = 10;
crawlConfig.MaxPagesToCrawl = int.MaxValue;
crawlConfig.MaxCrawlDepth = int.MaxValue;
crawlConfig.UserAgentString = String.Format("xxx v{0}", GenUtil.AssemblyVersion());

PoliteWebCrawler crawler = new PoliteWebCrawler(crawlConfig);
crawler.PageCrawlStarting += DoOnPageCrawlStarting;
crawler.PageCrawlCompleted += DoOnPageCrawlCompleted;
CrawlResult result = crawler.Crawl(new Uri(url));`

This type of result is consistent, I have noticed many times that the first and maybe even second crawls don't get all the available urls. How can I assure that Abot gets the same number of urls on each crawl? From the above tests we would seem to need to run Abot 3 times for each domain to ensure that we get all urls, however I can't believe that this is the case, something else must be happening but I'm not clear on what that could be?!

Here's the configuration from the log: ` 2017-11-30 09:05:54,572 [10] INFO AbotLogger [(null)] [(null)] - Configuration Values:

2017-11-30 09:05:54,572 [10] INFO AbotLogger [(null)] [(null)] - Abot Version: 1.5.1.69 2017-11-30 09:05:54,573 [10] INFO AbotLogger [(null)] [(null)] - MaxConcurrentThreads: 10 2017-11-30 09:05:54,574 [10] INFO AbotLogger [(null)] [(null)] - MaxPagesToCrawl: 2147483647 2017-11-30 09:05:54,574 [10] INFO AbotLogger [(null)] [(null)] - MaxPagesToCrawlPerDomain: 0 2017-11-30 09:05:54,574 [10] INFO AbotLogger [(null)] [(null)] - MaxPageSizeInBytes: 0 2017-11-30 09:05:54,574 [10] INFO AbotLogger [(null)] [(null)] - UserAgentString: ScreenShooter v1.0.0 2017-11-30 09:05:54,575 [10] INFO AbotLogger [(null)] [(null)] - CrawlTimeoutSeconds: 100 2017-11-30 09:05:54,575 [10] INFO AbotLogger [(null)] [(null)] - IsUriRecrawlingEnabled: False 2017-11-30 09:05:54,575 [10] INFO AbotLogger [(null)] [(null)] - IsExternalPageCrawlingEnabled: False 2017-11-30 09:05:54,575 [10] INFO AbotLogger [(null)] [(null)] - IsExternalPageLinksCrawlingEnabled: False 2017-11-30 09:05:54,576 [10] INFO AbotLogger [(null)] [(null)] - IsRespectUrlNamedAnchorOrHashbangEnabled: False 2017-11-30 09:05:54,576 [10] INFO AbotLogger [(null)] [(null)] - DownloadableContentTypes: text/html 2017-11-30 09:05:54,576 [10] INFO AbotLogger [(null)] [(null)] - HttpServicePointConnectionLimit: 200 2017-11-30 09:05:54,577 [10] INFO AbotLogger [(null)] [(null)] - HttpRequestTimeoutInSeconds: 100 2017-11-30 09:05:54,577 [10] INFO AbotLogger [(null)] [(null)] - HttpRequestMaxAutoRedirects: 7 2017-11-30 09:05:54,577 [10] INFO AbotLogger [(null)] [(null)] - IsHttpRequestAutoRedirectsEnabled: True 2017-11-30 09:05:54,577 [10] INFO AbotLogger [(null)] [(null)] - IsHttpRequestAutomaticDecompressionEnabled: False 2017-11-30 09:05:54,578 [10] INFO AbotLogger [(null)] [(null)] - IsSendingCookiesEnabled: False 2017-11-30 09:05:54,578 [10] INFO AbotLogger [(null)] [(null)] - IsSslCertificateValidationEnabled: True 2017-11-30 09:05:54,578 [10] INFO AbotLogger [(null)] [(null)] - MinAvailableMemoryRequiredInMb: 0 2017-11-30 09:05:54,578 [10] INFO AbotLogger [(null)] [(null)] - MaxMemoryUsageInMb: 0 2017-11-30 09:05:54,579 [10] INFO AbotLogger [(null)] [(null)] - MaxMemoryUsageCacheTimeInSeconds: 0 2017-11-30 09:05:54,579 [10] INFO AbotLogger [(null)] [(null)] - MaxCrawlDepth: 2147483647 2017-11-30 09:05:54,579 [10] INFO AbotLogger [(null)] [(null)] - MaxLinksPerPage: 0 2017-11-30 09:05:54,579 [10] INFO AbotLogger [(null)] [(null)] - IsForcedLinkParsingEnabled: False 2017-11-30 09:05:54,580 [10] INFO AbotLogger [(null)] [(null)] - MaxRetryCount: 0 2017-11-30 09:05:54,580 [10] INFO AbotLogger [(null)] [(null)] - MinRetryDelayInMilliseconds: 0 2017-11-30 09:05:54,580 [10] INFO AbotLogger [(null)] [(null)] - IsRespectRobotsDotTextEnabled: False 2017-11-30 09:05:54,580 [10] INFO AbotLogger [(null)] [(null)] - IsRespectMetaRobotsNoFollowEnabled: False 2017-11-30 09:05:54,581 [10] INFO AbotLogger [(null)] [(null)] - IsRespectHttpXRobotsTagHeaderNoFollowEnabled: False 2017-11-30 09:05:54,581 [10] INFO AbotLogger [(null)] [(null)] - IsRespectAnchorRelNoFollowEnabled: False 2017-11-30 09:05:54,581 [10] INFO AbotLogger [(null)] [(null)] - IsIgnoreRobotsDotTextIfRootDisallowedEnabled: False 2017-11-30 09:05:54,581 [10] INFO AbotLogger [(null)] [(null)] - RobotsDotTextUserAgentString: abot 2017-11-30 09:05:54,582 [10] INFO AbotLogger [(null)] [(null)] - MinCrawlDelayPerDomainMilliSeconds: 0 2017-11-30 09:05:54,582 [10] INFO AbotLogger [(null)] [(null)] - MaxRobotsDotTextCrawlDelayInSeconds: 5 2017-11-30 09:05:54,582 [10] INFO AbotLogger [(null)] [(null)] - IsAlwaysLogin: False 2017-11-30 09:05:54,582 [10] INFO AbotLogger [(null)] [(null)] - LoginUser: 2017-11-30 09:05:54,583 [10] INFO AbotLogger [(null)] [(null)] - LoginPassword: `

sjdirect commented 6 years ago

Have you tried crawling other sites with the same configuration? Are those counts consistent? I have encountered many sites that dynamically change the links (on the backend and/or front end js) which alters the final result. Also site performance for the same site doesn't necessarily stay constant.

benempson commented 6 years ago

Hi, it seems to be dependent on the site, some do, some don't. We're currently thinking that Abot is getting a timeout for some pages, and thus they are not being recorded. As such, doesn't look like an Abot problem, sorry!

sjdirect commented 6 years ago

Thanks for the update.