Closed benempson closed 6 years ago
Have you tried crawling other sites with the same configuration? Are those counts consistent? I have encountered many sites that dynamically change the links (on the backend and/or front end js) which alters the final result. Also site performance for the same site doesn't necessarily stay constant.
Hi, it seems to be dependent on the site, some do, some don't. We're currently thinking that Abot is getting a timeout for some pages, and thus they are not being recorded. As such, doesn't look like an Abot problem, sorry!
Thanks for the update.
Hi there, I've noticed that I get different results from running crawls on the same site with the same settings, generally that more pages are found on each subsequent crawl. The site does not have javascript generated links and I'm using Abot only, not AbotX.
I ran some tests this morning. The results were as follows:
The elapsed times also seem significant in that the final 3 crawls were much shorter than the first 3, this seems to imply some sort of cache. However, the fact that the last test found 272 urls and also in a short time after the process was killed and restarted goes against that.
Here's my code:
This type of result is consistent, I have noticed many times that the first and maybe even second crawls don't get all the available urls. How can I assure that Abot gets the same number of urls on each crawl? From the above tests we would seem to need to run Abot 3 times for each domain to ensure that we get all urls, however I can't believe that this is the case, something else must be happening but I'm not clear on what that could be?!
Here's the configuration from the log: ` 2017-11-30 09:05:54,572 [10] INFO AbotLogger [(null)] [(null)] - Configuration Values:
2017-11-30 09:05:54,572 [10] INFO AbotLogger [(null)] [(null)] - Abot Version: 1.5.1.69 2017-11-30 09:05:54,573 [10] INFO AbotLogger [(null)] [(null)] - MaxConcurrentThreads: 10 2017-11-30 09:05:54,574 [10] INFO AbotLogger [(null)] [(null)] - MaxPagesToCrawl: 2147483647 2017-11-30 09:05:54,574 [10] INFO AbotLogger [(null)] [(null)] - MaxPagesToCrawlPerDomain: 0 2017-11-30 09:05:54,574 [10] INFO AbotLogger [(null)] [(null)] - MaxPageSizeInBytes: 0 2017-11-30 09:05:54,574 [10] INFO AbotLogger [(null)] [(null)] - UserAgentString: ScreenShooter v1.0.0 2017-11-30 09:05:54,575 [10] INFO AbotLogger [(null)] [(null)] - CrawlTimeoutSeconds: 100 2017-11-30 09:05:54,575 [10] INFO AbotLogger [(null)] [(null)] - IsUriRecrawlingEnabled: False 2017-11-30 09:05:54,575 [10] INFO AbotLogger [(null)] [(null)] - IsExternalPageCrawlingEnabled: False 2017-11-30 09:05:54,575 [10] INFO AbotLogger [(null)] [(null)] - IsExternalPageLinksCrawlingEnabled: False 2017-11-30 09:05:54,576 [10] INFO AbotLogger [(null)] [(null)] - IsRespectUrlNamedAnchorOrHashbangEnabled: False 2017-11-30 09:05:54,576 [10] INFO AbotLogger [(null)] [(null)] - DownloadableContentTypes: text/html 2017-11-30 09:05:54,576 [10] INFO AbotLogger [(null)] [(null)] - HttpServicePointConnectionLimit: 200 2017-11-30 09:05:54,577 [10] INFO AbotLogger [(null)] [(null)] - HttpRequestTimeoutInSeconds: 100 2017-11-30 09:05:54,577 [10] INFO AbotLogger [(null)] [(null)] - HttpRequestMaxAutoRedirects: 7 2017-11-30 09:05:54,577 [10] INFO AbotLogger [(null)] [(null)] - IsHttpRequestAutoRedirectsEnabled: True 2017-11-30 09:05:54,577 [10] INFO AbotLogger [(null)] [(null)] - IsHttpRequestAutomaticDecompressionEnabled: False 2017-11-30 09:05:54,578 [10] INFO AbotLogger [(null)] [(null)] - IsSendingCookiesEnabled: False 2017-11-30 09:05:54,578 [10] INFO AbotLogger [(null)] [(null)] - IsSslCertificateValidationEnabled: True 2017-11-30 09:05:54,578 [10] INFO AbotLogger [(null)] [(null)] - MinAvailableMemoryRequiredInMb: 0 2017-11-30 09:05:54,578 [10] INFO AbotLogger [(null)] [(null)] - MaxMemoryUsageInMb: 0 2017-11-30 09:05:54,579 [10] INFO AbotLogger [(null)] [(null)] - MaxMemoryUsageCacheTimeInSeconds: 0 2017-11-30 09:05:54,579 [10] INFO AbotLogger [(null)] [(null)] - MaxCrawlDepth: 2147483647 2017-11-30 09:05:54,579 [10] INFO AbotLogger [(null)] [(null)] - MaxLinksPerPage: 0 2017-11-30 09:05:54,579 [10] INFO AbotLogger [(null)] [(null)] - IsForcedLinkParsingEnabled: False 2017-11-30 09:05:54,580 [10] INFO AbotLogger [(null)] [(null)] - MaxRetryCount: 0 2017-11-30 09:05:54,580 [10] INFO AbotLogger [(null)] [(null)] - MinRetryDelayInMilliseconds: 0 2017-11-30 09:05:54,580 [10] INFO AbotLogger [(null)] [(null)] - IsRespectRobotsDotTextEnabled: False 2017-11-30 09:05:54,580 [10] INFO AbotLogger [(null)] [(null)] - IsRespectMetaRobotsNoFollowEnabled: False 2017-11-30 09:05:54,581 [10] INFO AbotLogger [(null)] [(null)] - IsRespectHttpXRobotsTagHeaderNoFollowEnabled: False 2017-11-30 09:05:54,581 [10] INFO AbotLogger [(null)] [(null)] - IsRespectAnchorRelNoFollowEnabled: False 2017-11-30 09:05:54,581 [10] INFO AbotLogger [(null)] [(null)] - IsIgnoreRobotsDotTextIfRootDisallowedEnabled: False 2017-11-30 09:05:54,581 [10] INFO AbotLogger [(null)] [(null)] - RobotsDotTextUserAgentString: abot 2017-11-30 09:05:54,582 [10] INFO AbotLogger [(null)] [(null)] - MinCrawlDelayPerDomainMilliSeconds: 0 2017-11-30 09:05:54,582 [10] INFO AbotLogger [(null)] [(null)] - MaxRobotsDotTextCrawlDelayInSeconds: 5 2017-11-30 09:05:54,582 [10] INFO AbotLogger [(null)] [(null)] - IsAlwaysLogin: False 2017-11-30 09:05:54,582 [10] INFO AbotLogger [(null)] [(null)] - LoginUser: 2017-11-30 09:05:54,583 [10] INFO AbotLogger [(null)] [(null)] - LoginPassword: `