Closed cbelden closed 9 years ago
Note: I've since updated this implementation to use a custom Scheduler which comes pre-queued with all oft he URLs I wish to index.
Here's the updated Crawler class for reference:
public class MyCrawler
{
private IEnumerable<string> _urls { get; set; }
public Crawler(IEnumerable<string> urls) { this._urls = urls; }
public void Crawl()
{
// Only want the crawler to index the provided URLs, so set
// the MaxCrawlDepth to 0.
var config = new CrawlConfiguration();
config.MaxCrawlDepth = 0;
// Create a custom scheduler that is pre-queued with the list
// of URLs to crawl.
var scheduler = new UrlScheduler(this._urls);
var crawler = new PoliteWebCrawler(config, null, null, scheduler, null, null, null, null, null);
// TODO: Use these events to perform work on content
crawler.PageCrawlCompletedAsync += crawler_PageCrawlCompletedAsync;
// Start crawl
crawler.Crawl(new Uri("http://www.some-site.com"));
}
private void crawler_PageCrawlCompletedAsync(object sender, PageCrawlCompletedArgs e)
{
Console.WriteLine("Finished crawling: " + e.CrawledPage.Uri);
}
}
cbelden, Interesting
Can you detail UrlScheduler class ?
Yep; here it is. It just subclasses Abot's Scheduler class and bootstraps the queue with a list of URLs.
class UrlScheduler : Scheduler
{
/// <summary>
/// Instantiate the URL queue with list of URLs.
/// </summary>
/// <param name="urls"></param>
public UrlScheduler(IEnumerable<string> urls)
: base()
{
this.Add(urls.Select(url => new PageToCrawl(new Uri(url))));
}
}
@cbelden This way we will run crawler to multi-threaded crawl multiple urls right? I have tried to implement it using code you attached with no success. When I call Crawl nothing happens. Help?
Hi @radenkozec! Unfortunately I'm not too familiar with this code anymore, and I do not have access to my previous implementation.
I think I modified the default scheduler to come pre-loaded with a list of URLs (instead of performing a proper crawl, which recursively looks at the links from each crawled page). I think setting MaxCrawlDepth to something other than 0 might fix your problem (just guessing though :( ).
@cbelden Thanks. I managed to make it work.
Hello,
I would like to configure the abot crawler to index a list of URLs instead of actually crawling an entire site.
So far I've thought of the following:
This will only index the specified URLs (and the root); however, I'd like to avoid attempting to re-add all of the urls during each PageCrawlCompletedAsync event. Do you know of a better way to do this?
One option would be to expose an event that gets fired immediately before any crawling occurs; I would then be able to add all of the URLs once at the beginning of the crawl. Any help or advice would be much appreciated!
Thanks