Closed tinonetic closed 3 years ago
Abot's PoliteWebCrawler alone can crawl multiple PAGES of a single site concurrently. AbotX's ParallelCrawlerEngine is to manage multiple instances of Abot's PoliteWebCrawler instances, effectively allowing you to crawl multiple SITES concurrently.
Example shows how to get the content of a crawled page. var crawler = new PoliteWebCrawler(config); crawler.PageCrawlCompleted += PageCrawlCompleted;//Several events available... var crawlResult = await crawler.CrawlAsync(new Uri("http://!!!!!!!!YOURSITEHERE!!!!!!!!!.com"));
private static void PageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
{
var httpStatus = e.CrawledPage.HttpResponseMessage.StatusCode;
var rawPageText = e.CrawledPage.Content.Text;
}
Thank-you for the info. Very helpful
One last clarification. If I have a website and I want it to crawl specific pages, not the whole site, do I have to make it crawl the entire site?
How do I direct it to crawl, say paged content. For example:
www.mysite.com/puppies?page=1
www.mysite.com/puppies?page=2
www.mysite.com/puppies?page=3
www.mysite.com/puppies?page=4
www.mysite.com/puppies?page=5
...and I do not want it to crawl
www.mysite.com/contact
www.mysite.com/puppies/blog
www.mysite.com/services
Thank-you for your patience!
Hi,
Thanks for the product!
Apologies for the many questions.
How would I crawl a single site with multiple pages in parallel? Do I need AbotX or Abot would do? Do I need to loop through the list of sites if I can only do 3 at a time for the free version? Is it ideal to have this in a job that keeps track of runs? Also it doesn't say which part of the code I get the crawled data...is it in
crawlEngine.SiteCrawlCompleted
, after thelock(crawlCounts){...}
statment?Example