mohankreddy / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Too many open files #118

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Looks like the crawler is not properly closing file somewhere in the code base.

I have been sequentiality crawling a number websites. A new CrawlController is 
created for each site as follow:

CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);
config.setUserAgentString(userAgent);
config.setFollowRedirects(true);
config.setIncludeBinaryContentInCrawling(false);
config.setResumableCrawling(false);
config.setMaxPagesToFetch(10000);

PageFetcher pageFetcher = new PageFetcher(config);

RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, 
pageFetcher);

controller = new CrawlController(config, pageFetcher, robotstxtServer);
controller.addSeed(domain);     
controller.start(IndexCrawler.class, 1);

After every few days the crawler4j will start throwing following errors:

com.sleepycat.je.EnvironmentFailureException: (JE 5.0.34) 
/opt/projects/tmp/edb88a8a-d7e4-4684-bc5a-40cad4f32c9a/frontier 
java.io.FileNotFoundException: /opt/projects//tmp/edb88
a8a-d7e4-4684-bc5a-40cad4f32c9a/frontier/je.lck (Too many open files) 
LOG_INTEGRITY: Log information is incorrect, problem is likely persistent. 
Environment is invalid and must be closed.
    at com.sleepycat.je.log.FileManager.lockEnvironment(FileManager.java:2186)
    at com.sleepycat.je.log.FileManager.<init>(FileManager.java:370)
    at com.sleepycat.je.dbi.EnvironmentImpl.<init>(EnvironmentImpl.java:447)
    at com.sleepycat.je.dbi.EnvironmentImpl.<init>(EnvironmentImpl.java:376)
    at com.sleepycat.je.dbi.DbEnvPool.getEnvironment(DbEnvPool.java:180)
    at com.sleepycat.je.Environment.makeEnvironmentImpl(Environment.java:246)
    at com.sleepycat.je.Environment.<init>(Environment.java:227)
    at com.sleepycat.je.Environment.<init>(Environment.java:170)
    at edu.uci.ics.crawler4j.crawler.CrawlController.<init>(CrawlController.java:106)
    at jobs.IndexJob.doJob(IndexJob.java:74)
    at play.jobs.Job.doJobWithResult(Job.java:50)
    at play.jobs.Job.call(Job.java:146)
    at Invocation.Job(Play!)
Caused by: java.io.FileNotFoundException: 
/opt/projects/tmp/edb88a8a-d7e4-4684-bc5a-40cad4f32c9a/frontier/je.lck (Too 
many open files)
    at java.io.RandomAccessFile.open(Native Method)
    at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233)
    at com.sleepycat.je.log.FileManager.lockEnvironment(FileManager.java:2152)
    ... 12 more

The totally number of open files allowed by the system is relatively high.

[root@domU-12-31-39-00-DE-45 logs]# cat /proc/sys/fs/file-max
170194
[root@domU-12-31-39-00-DE-45 logs]# 

I have to start the application for the crawler4j to work again. 

Original issue reported on code.google.com by tahs...@trademango.com on 30 Jan 2012 at 9:49

GoogleCodeExporter commented 9 years ago
Typo: I have to restart the application for crawler4j to work again

Original comment by tahs...@trademango.com on 30 Jan 2012 at 9:51

GoogleCodeExporter commented 9 years ago
I am using version: 3.1

Original comment by tahs...@trademango.com on 30 Jan 2012 at 9:53

GoogleCodeExporter commented 9 years ago
Why are you creating a new controller for each website? You don't need to do 
that. Only one controller is enough.

-Yasser

Original comment by ganjisaffar@gmail.com on 31 Jan 2012 at 6:53

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
I think I can keep a static instance of CrawlController around and reuse the 
controller. But what if I want to start two crawls, let say in different 
threads. Will that be possible?

Original comment by tahs...@trademango.com on 31 Jan 2012 at 8:31

GoogleCodeExporter commented 9 years ago
Actually I can not use one CrawlController as there is no removeSeed or similar 
method.

The thing is every website require certain customs data to be set in the 
controller.             

controller.setCustomData(dataMap);

This custom data is different for each website. So i can not add the all 
websites as seed in one controller.

Original comment by tahs...@trademango.com on 31 Jan 2012 at 8:35

GoogleCodeExporter commented 9 years ago
customData can be anything. You can create a class for it and have a data 
structure for each website. So, this is not an issue.

-Yasser

Original comment by ganjisaffar@gmail.com on 31 Jan 2012 at 8:59

GoogleCodeExporter commented 9 years ago

Original comment by ganjisaffar@gmail.com on 4 Feb 2012 at 11:35