Closed GoogleCodeExporter closed 9 years ago
[deleted comment]
The crawler crawls the web, it doesn't store the web.
The berklyDB stores only internal URLs and it is for internal crawler
functionality and not for the user to read the DB files.
No need to install any berlyDB util or anything, it is being run on the fly by
the crawler.
If you want to store all of the pages the crawler crawled then go to your
Crawler file and in the visit(Page page) method do the following:
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String text = htmlParseData.getText();
String html = htmlParseData.getHtml();
}
That's it, html & text contain your page, you can store it or do anything you
want with it.
Original comment by avrah...@gmail.com
on 11 Aug 2014 at 1:59
Not a bug or feature request
Original comment by avrah...@gmail.com
on 11 Aug 2014 at 1:59
Original issue reported on code.google.com by
shweta.j...@gmail.com
on 16 May 2012 at 7:31