sjdirect / abotx

Cross Platform C# Web crawler framework, headless browser, parallel crawler. Please star this project! +1.
https://abotx.org
131 stars 23 forks source link

How would I crawl a single site with multiple pages in parallel? #24

Closed tinonetic closed 3 years ago

tinonetic commented 3 years ago

Hi,

Thanks for the product!

Apologies for the many questions.

How would I crawl a single site with multiple pages in parallel? Do I need AbotX or Abot would do? Do I need to loop through the list of sites if I can only do 3 at a time for the free version? Is it ideal to have this in a job that keeps track of runs? Also it doesn't say which part of the code I get the crawled data...is it in crawlEngine.SiteCrawlCompleted, after the lock(crawlCounts){...} statment?

Example

        private static async Task DemoParallelCrawlerEngine()
        {
            var siteToCrawlProvider = new SiteToCrawlProvider();
            siteToCrawlProvider.AddSitesToCrawl(new List<SiteToCrawl>
            {
                new SiteToCrawl{ Uri = new Uri("YOURSITE1") },
                new SiteToCrawl{ Uri = new Uri("YOURSITE2") },
                new SiteToCrawl{ Uri = new Uri("YOURSITE3") },
                new SiteToCrawl{ Uri = new Uri("YOURSITE4") },
                new SiteToCrawl{ Uri = new Uri("YOURSITE5") }
            });

            var config = GetSafeConfig();
            config.MaxConcurrentSiteCrawls = 3;

            var crawlEngine = new ParallelCrawlerEngine(
                config, 
                new ParallelImplementationOverride(config, 
                    new ParallelImplementationContainer()
                    {
                        SiteToCrawlProvider = siteToCrawlProvider,
                        WebCrawlerFactory = new WebCrawlerFactory(config)//Same config will be used for every crawler
                    })
                );                

            var crawlCounts = new Dictionary<Guid, int>();
            var siteStartingEvents = 0;
            var allSitesCompletedEvents = 0;
            crawlEngine.CrawlerInstanceCreated += (sender, eventArgs) =>
            {
                var crawlId = Guid.NewGuid();
                eventArgs.Crawler.CrawlBag.CrawlId = crawlId;
            };
            crawlEngine.SiteCrawlStarting += (sender, args) =>
            {
                Interlocked.Increment(ref siteStartingEvents);
            };
            crawlEngine.SiteCrawlCompleted += (sender, eventArgs) =>
            {
                lock (crawlCounts)
                {
                    crawlCounts.Add(eventArgs.CrawledSite.SiteToCrawl.Id, eventArgs.CrawledSite.CrawlResult.CrawlContext.CrawledCount);
                }
            };
            crawlEngine.AllCrawlsCompleted += (sender, eventArgs) =>
            {
                Interlocked.Increment(ref allSitesCompletedEvents);
            };

            await crawlEngine.StartAsync();
        }
sjdirect commented 3 years ago

Abot's PoliteWebCrawler alone can crawl multiple PAGES of a single site concurrently. AbotX's ParallelCrawlerEngine is to manage multiple instances of Abot's PoliteWebCrawler instances, effectively allowing you to crawl multiple SITES concurrently.

Example shows how to get the content of a crawled page. var crawler = new PoliteWebCrawler(config); crawler.PageCrawlCompleted += PageCrawlCompleted;//Several events available... var crawlResult = await crawler.CrawlAsync(new Uri("http://!!!!!!!!YOURSITEHERE!!!!!!!!!.com"));

    private static void PageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
    {
        var httpStatus = e.CrawledPage.HttpResponseMessage.StatusCode;
        var rawPageText = e.CrawledPage.Content.Text;
    }
tinonetic commented 3 years ago

Thank-you for the info. Very helpful

One last clarification. If I have a website and I want it to crawl specific pages, not the whole site, do I have to make it crawl the entire site?

How do I direct it to crawl, say paged content. For example:

www.mysite.com/puppies?page=1 www.mysite.com/puppies?page=2 www.mysite.com/puppies?page=3 www.mysite.com/puppies?page=4 www.mysite.com/puppies?page=5

...and I do not want it to crawl

www.mysite.com/contact www.mysite.com/puppies/blog www.mysite.com/services

Thank-you for your patience!