sjdirect / abot

Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
Apache License 2.0
2.25k stars 560 forks source link

List of URLs vs. a site crawl #79

Closed cbelden closed 9 years ago

cbelden commented 9 years ago

Hello,

I would like to configure the abot crawler to index a list of URLs instead of actually crawling an entire site.

So far I've thought of the following:

public class MyCrawlerWrapper
{
        private IEnumerable<string> _urls { get; set; }

        public Crawler(IEnumerable<string> urls) { this._urls = urls; }

        public void Crawl()
        {
            // Configure crawler to ignore all links
            var config = new CrawlConfiguration();
            config.MaxCrawlDepth = 0;             

            var crawler = new PoliteWebCrawler(config);
            crawler.PageCrawlStartingAsync += crawler_PageCrawlStartingAsync;
            crawler.PageCrawlCompletedAsync += crawler_PageCrawlCompletedAsync;

            crawler.Crawl(new Uri("http://www.some-root.com"));
        }

        private void crawler_PageCrawlCompletedAsync(object sender, PageCrawlCompletedArgs e)
        {
            Console.WriteLine("Finished crawling: " + e.CrawledPage.Uri);
        }

        private void crawler_PageCrawlStartingAsync(object sender, PageCrawlStartingArgs e)
        {
            Console.WriteLine("Starting crawl: " + e.PageToCrawl.Uri);
            e.CrawlContext.Scheduler.Add(this._urls.Select(u => new PageToCrawl(new Uri(u))));
        }
}

This will only index the specified URLs (and the root); however, I'd like to avoid attempting to re-add all of the urls during each PageCrawlCompletedAsync event. Do you know of a better way to do this?

One option would be to expose an event that gets fired immediately before any crawling occurs; I would then be able to add all of the URLs once at the beginning of the crawl. Any help or advice would be much appreciated!

Thanks

cbelden commented 9 years ago

Note: I've since updated this implementation to use a custom Scheduler which comes pre-queued with all oft he URLs I wish to index.

Here's the updated Crawler class for reference:

    public class MyCrawler
    {
        private IEnumerable<string> _urls { get; set; }

        public Crawler(IEnumerable<string> urls) { this._urls = urls; }

        public void Crawl()
        {
            // Only want the crawler to index the provided URLs, so set
            // the MaxCrawlDepth to 0.
            var config = new CrawlConfiguration();
            config.MaxCrawlDepth = 0;

            // Create a custom scheduler that is pre-queued with the list
            // of URLs to crawl.
            var scheduler = new UrlScheduler(this._urls);

            var crawler = new PoliteWebCrawler(config, null, null, scheduler, null, null, null, null, null);

            // TODO: Use these events to perform work on content
            crawler.PageCrawlCompletedAsync += crawler_PageCrawlCompletedAsync;

            // Start crawl
            crawler.Crawl(new Uri("http://www.some-site.com"));
        }

        private void crawler_PageCrawlCompletedAsync(object sender, PageCrawlCompletedArgs e)
        {
            Console.WriteLine("Finished crawling: " + e.CrawledPage.Uri);
        }
    }
LeMoussel commented 9 years ago

cbelden, Interesting
Can you detail UrlScheduler class ?

cbelden commented 9 years ago

Yep; here it is. It just subclasses Abot's Scheduler class and bootstraps the queue with a list of URLs.

    class UrlScheduler : Scheduler
    {
        /// <summary>
        /// Instantiate the URL queue with list of URLs.
        /// </summary>
        /// <param name="urls"></param>
        public UrlScheduler(IEnumerable<string> urls)
            : base()
        {
            this.Add(urls.Select(url => new PageToCrawl(new Uri(url))));
        }
    }
radenkozec commented 7 years ago

@cbelden This way we will run crawler to multi-threaded crawl multiple urls right? I have tried to implement it using code you attached with no success. When I call Crawl nothing happens. Help?

cbelden commented 7 years ago

Hi @radenkozec! Unfortunately I'm not too familiar with this code anymore, and I do not have access to my previous implementation.

I think I modified the default scheduler to come pre-loaded with a list of URLs (instead of performing a proper crawl, which recursively looks at the links from each crawled page). I think setting MaxCrawlDepth to something other than 0 might fix your problem (just guessing though :( ).

radenkozec commented 7 years ago

@cbelden Thanks. I managed to make it work.