Add IEnumerable<Uri> PageLinks

GoogleCodeExporter commented 9 years ago

Add IEnumerable<Uri> PageLinks so users can access them without reparsing.

Original issue reported on code.google.com by sjdir...@gmail.com on 5 Mar 2013 at 5:24

GoogleCodeExporter commented 9 years ago

Either add it to the CrawledPage object or create another object "ParsedPage" 
that will have the links and possibly a few other values that could be useful 
to users.

Original comment by sjdir...@gmail.com on 5 Mar 2013 at 5:30

GoogleCodeExporter commented 9 years ago

The list of urls found on the page is not currently saved. Created issue 76 () 
to consider adding this to the CrawledPage object. In the meantime the 
workaround is...

//register for the CrawlCompleted event
crawler.PageCrawlCompletedAsync += crawler_ProcessPageCrawlCompleted;

//the event method
void crawler_ProcessPageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
{
    IEnumerable<Uri> allLinksOnPage = new HapHyperLinkParser().GetLinks(e.CrawledPage);
    IEnumerable<Uri> internalLinks = allLinksOnPage.Where(l => l.Authority == e.CrawlContext.RootUri.Authority);
    IEnumerable<Uri> externalLinks = allLinksOnPage.Except(internalLinks);
}

A few notes...
-This is the exact method that Abot uses by default to parse links
-Even though the parsing happens twice, the most expensive operation (loading 
the htmlagilitypack HtmlDocument object) only happens the first time.

Original comment by sjdir...@gmail.com on 5 Mar 2013 at 5:48

GoogleCodeExporter commented 9 years ago

Original comment by sjdir...@gmail.com on 29 Mar 2013 at 5:12

Added labels: Milestone-Release1.1

GoogleCodeExporter commented 9 years ago

Original comment by sjdir...@gmail.com on 9 Apr 2013 at 4:31

Removed labels: Milestone-Release1.1

GoogleCodeExporter commented 9 years ago

Original comment by sjdir...@gmail.com on 27 Apr 2013 at 7:28

Added labels: Milestone-Release1.2

GoogleCodeExporter commented 9 years ago

This issue was closed by revision r497.

Original comment by sjdir...@gmail.com on 1 May 2013 at 11:42

Changed state: Fixed

sethia4u / abot

Add IEnumerable<Uri> PageLinks #76