nhattn / abot

Automatically exported from code.google.com/p/abot
Apache License 2.0
0 stars 0 forks source link

Add IEnumerable<Uri> PageLinks #76

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Add IEnumerable<Uri> PageLinks so users can access them without reparsing.

Original issue reported on code.google.com by sjdir...@gmail.com on 5 Mar 2013 at 5:24

GoogleCodeExporter commented 9 years ago
Either add it to the CrawledPage object or create another object "ParsedPage" 
that will have the links and possibly a few other values that could be useful 
to users.

Original comment by sjdir...@gmail.com on 5 Mar 2013 at 5:30

GoogleCodeExporter commented 9 years ago
The list of urls found on the page is not currently saved. Created issue 76 () 
to consider adding this to the CrawledPage object. In the meantime the 
workaround is...

//register for the CrawlCompleted event
crawler.PageCrawlCompletedAsync += crawler_ProcessPageCrawlCompleted;

//the event method
void crawler_ProcessPageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
{
    IEnumerable<Uri> allLinksOnPage = new HapHyperLinkParser().GetLinks(e.CrawledPage);
    IEnumerable<Uri> internalLinks = allLinksOnPage.Where(l => l.Authority == e.CrawlContext.RootUri.Authority);
    IEnumerable<Uri> externalLinks = allLinksOnPage.Except(internalLinks);
}

A few notes...
-This is the exact method that Abot uses by default to parse links
-Even though the parsing happens twice, the most expensive operation (loading 
the htmlagilitypack HtmlDocument object) only happens the first time.

Original comment by sjdir...@gmail.com on 5 Mar 2013 at 5:48

GoogleCodeExporter commented 9 years ago

Original comment by sjdir...@gmail.com on 29 Mar 2013 at 5:12

GoogleCodeExporter commented 9 years ago

Original comment by sjdir...@gmail.com on 9 Apr 2013 at 4:31

GoogleCodeExporter commented 9 years ago

Original comment by sjdir...@gmail.com on 27 Apr 2013 at 7:28

GoogleCodeExporter commented 9 years ago
This issue was closed by revision r497.

Original comment by sjdir...@gmail.com on 1 May 2013 at 11:42