Closed GoogleCodeExporter closed 9 years ago
Either add it to the CrawledPage object or create another object "ParsedPage"
that will have the links and possibly a few other values that could be useful
to users.
Original comment by sjdir...@gmail.com
on 5 Mar 2013 at 5:30
The list of urls found on the page is not currently saved. Created issue 76 ()
to consider adding this to the CrawledPage object. In the meantime the
workaround is...
//register for the CrawlCompleted event
crawler.PageCrawlCompletedAsync += crawler_ProcessPageCrawlCompleted;
//the event method
void crawler_ProcessPageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
{
IEnumerable<Uri> allLinksOnPage = new HapHyperLinkParser().GetLinks(e.CrawledPage);
IEnumerable<Uri> internalLinks = allLinksOnPage.Where(l => l.Authority == e.CrawlContext.RootUri.Authority);
IEnumerable<Uri> externalLinks = allLinksOnPage.Except(internalLinks);
}
A few notes...
-This is the exact method that Abot uses by default to parse links
-Even though the parsing happens twice, the most expensive operation (loading
the htmlagilitypack HtmlDocument object) only happens the first time.
Original comment by sjdir...@gmail.com
on 5 Mar 2013 at 5:48
Original comment by sjdir...@gmail.com
on 29 Mar 2013 at 5:12
Original comment by sjdir...@gmail.com
on 9 Apr 2013 at 4:31
Original comment by sjdir...@gmail.com
on 27 Apr 2013 at 7:28
This issue was closed by revision r497.
Original comment by sjdir...@gmail.com
on 1 May 2013 at 11:42
Original issue reported on code.google.com by
sjdir...@gmail.com
on 5 Mar 2013 at 5:24