mohankreddy / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Exception While Crawling !! #15

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. I am using a list of urls to start many concurrent web crawlers using your 
libraries, I was very keen on synchronization issues and so on.
2. Run the crawler on some machines get me a strange exception as follows:
****************************************
Exception in thread "Crawler 8" java.lang.IllegalStateException: Can't open a 
cursor Database was closed.
    at com.sleepycat.je.Database.checkOpen(Database.java:1745)
    at com.sleepycat.je.Database.openCursor(Database.java:619)
    at edu.uci.ics.crawler4j.frontier.WorkQueues.get(WorkQueues.java:50)
    at edu.uci.ics.crawler4j.frontier.Frontier.getNextURLs(Frontier.java:74)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:72)
    at java.lang.Thread.run(Thread.java:636)
java.lang.IllegalStateException: Can't call Database.put Database was closed.
    at com.sleepycat.je.Database.checkOpen(Database.java:1745)
    at com.sleepycat.je.Database.put(Database.java:1046)
    at edu.uci.ics.crawler4j.frontier.WorkQueues.put(WorkQueues.java:100)
    at edu.uci.ics.crawler4j.frontier.Frontier.scheduleAll(Frontier.java:48)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.preProcessPage(WebCrawler.java:142)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:85)
    at java.lang.Thread.run(Thread.java:636)
java.lang.IllegalStateException: Can't call Database.put Database was closed.
    at com.sleepycat.je.Database.checkOpen(Database.java:1745)
    at com.sleepycat.je.Database.put(Database.java:1046)
    at edu.uci.ics.crawler4j.frontier.WorkQueues.put(WorkQueues.java:100)
    at edu.uci.ics.crawler4j.frontier.Frontier.scheduleAll(Frontier.java:48)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.preProcessPage(WebCrawler.java:142)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:85)
    at java.lang.Thread.run(Thread.java:636)
java.lang.IllegalStateException: Can't call Database.put Database was closed.
    at com.sleepycat.je.Database.checkOpen(Database.java:1745)
    at com.sleepycat.je.Database.put(Database.java:1046)
    at edu.uci.ics.crawler4j.frontier.WorkQueues.put(WorkQueues.java:100)
    at edu.uci.ics.crawler4j.frontier.Frontier.scheduleAll(Frontier.java:48)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.preProcessPage(WebCrawler.java:142)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:85)
    at java.lang.Thread.run(Thread.java:636)
java.lang.IllegalStateException: Can't call Database.put Database was closed.
    at com.sleepycat.je.Database.checkOpen(Database.java:1745)
    at com.sleepycat.je.Database.put(Database.java:1046)
    at edu.uci.ics.crawler4j.frontier.WorkQueues.put(WorkQueues.java:100)
    at edu.uci.ics.crawler4j.frontier.Frontier.scheduleAll(Frontier.java:48)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.preProcessPage(WebCrawler.java:142)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:85)
    at java.lang.Thread.run(Thread.java:636)
java.lang.IllegalStateException: Can't call Database.put Database was closed.
    at com.sleepycat.je.Database.checkOpen(Database.java:1745)
    at com.sleepycat.je.Database.put(Database.java:1046)
    at edu.uci.ics.crawler4j.frontier.WorkQueues.put(WorkQueues.java:100)
    at edu.uci.ics.crawler4j.frontier.Frontier.scheduleAll(Frontier.java:48)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.preProcessPage(WebCrawler.java:142)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:85)
    at java.lang.Thread.run(Thread.java:636)
****************************************

What is the expected output? What do you see instead?
- It's expected to do crawling silently without any exceptions.

What version of the product are you using? On what operating system?
- I tried the latest version "crawler4j-2.2". i tried it on both Linux/Ubuntu 
10.04.1 and on Windows XP SP3.

Please provide any additional information below.
- I am using wget to download the pages completely in a website/like directory 
structure.
- Sometimes when i am using windows xp sp3 that's on the same machine that i 
have ubuntu on it, it works very well without exceptions.
- I am packing the whole code in executable jar and i am running it on command 
line.
- When running the executable jar i expand the minimum and maximum heap spaces 
to 512m and 1024m respectively.

Thanks

Original issue reported on code.google.com by hafez.khaled on 4 Oct 2010 at 10:12

GoogleCodeExporter commented 9 years ago
Hi,
Can you attach your controller and MyCrawler source files.

Thanks,
Yasser

Original comment by ganjisaffar@gmail.com on 4 Oct 2010 at 7:51

GoogleCodeExporter commented 9 years ago
Hey,

MyCrawler --> The Crawler Class that contains visit and shouldVisit Methods
MyCrawlerHandler --> The Controller Class that initiates MyCrawler (Crawler 
Class)

Thanks

Original comment by hafez.khaled on 5 Oct 2010 at 6:41

Attachments:

GoogleCodeExporter commented 9 years ago
Dear Yasser,

I knew the problem. I think there is some kind of limitation on the number of 
crawlers running at the same time, I think it's related to DB access (ex. 
Number of Users that can access DB a time). I think it's limited to 15 users 
only (that means 15 crawlers at the same time).

The problem is that i was running about 35 Crawler at the same time and at some 
point in running (this point randomly happen because it requires delay in DB 
access regardless syncing or not). So, i narrowed the number of crawlers to 15 
and it's running perfectly.

I request, If you can modify "crawler4j.properties" to support the number of 
users that can access the db at the same time.

Note: Please, test what i am saying and correct me if i am wrong.

Thanks

Original comment by hafez.khaled on 11 Oct 2010 at 8:52

GoogleCodeExporter commented 9 years ago

Original comment by ganjisaffar@gmail.com on 12 Mar 2011 at 12:03

GoogleCodeExporter commented 9 years ago
I had the same error, I fixed it adding: controller.shutdown() and 
controller.waitUntilFinish() at the and of each Crawler cycle. (My crawler was 
into a while(true) condition).

Original comment by corti.ri...@gmail.com on 9 Jul 2013 at 12:19