mohankreddy / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Give developers the option of getting the urls on a page themselves #141

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Under WebCrawler.ProcessPage(WebURL curURL) the implementation currently gets 
all urls on a page. The project I need crawler4j for requires that I only get 
certain urls matching a given html selector. 

I've extended your code for my purposes, but it would be nice if there was a 
method that gets called within the WebCrawler that is named something like 
GetUrlsFromSource(String source). Maybe a few overrides would be nice too, but 
this would allow a bit more customization.

Of course it would default to grabbing all links, but giving a html selector 
would be nice if a developer needed it. 

It could be up to the user how they want to get the urls if they override the 
GetUrlsFromSource method

Original issue reported on code.google.com by bsham...@gmail.com on 2 Apr 2012 at 3:17

GoogleCodeExporter commented 9 years ago
You can do it very simply.

In your crawler you are using the "visit(Page page)" method

In the page object you have the complete html of the page:
if (page.getParseData() instanceof HtmlParseData) {
  HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
  String html = htmlParseData.getHtml();
}

Use that html object to parse it (using jsoup?) and take whatever you need

Original comment by avrah...@gmail.com on 11 Aug 2014 at 1:36

GoogleCodeExporter commented 9 years ago
Not a bug or feature request

Original comment by avrah...@gmail.com on 11 Aug 2014 at 1:36