sawantuday / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Where is Crawled Data being stored after crawling ends #155

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
I test run the controller class .I configured maxDepthOfCrawling=2 and 
numberOfCrawlers=2.

I expected it to quickly return the crawled data at crawlStorageFolder or 
elsewhere. But I am not able to find the information anywhere even when the 
crawling ends.The only information I could find are two .lck files and one .jdb 
file at crawlStorageFolder location but unable to open them up as well.
I want to use the data for my search application.

I am using version 3.3

Also,I am not able to find the logger.info lines being written in the log file 
inside .metadata folder.
I am using Eclipse IDE.
I have yet not installed Oracle Berkeley DB.Is it a must or can i get the 
required info in a flat file?
Kindly revert.

Original issue reported on code.google.com by shweta.j...@gmail.com on 16 May 2012 at 7:31

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
The crawler crawls the web, it doesn't store the web.

The berklyDB stores only internal URLs and it is for internal crawler 
functionality and not for the user to read the DB files.

No need to install any berlyDB util or anything, it is being run on the fly by 
the crawler.

If you want to store all of the pages the crawler crawled then go to your 
Crawler file and in the visit(Page page) method do the following:
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String text = htmlParseData.getText();
String html = htmlParseData.getHtml();
}

That's it, html & text contain your page, you can store it or do anything you 
want with it.

Original comment by avrah...@gmail.com on 11 Aug 2014 at 1:59

GoogleCodeExporter commented 9 years ago
Not a bug or feature request

Original comment by avrah...@gmail.com on 11 Aug 2014 at 1:59