mohankreddy / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Exception --> Can't open a cursor Database was closed #81

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. I've a seed page which has link to 2 html files (e.g. 1.html, 2.html). Each 
html (1.html, 2.html) has links to 1000 html files each.The crawler iterates 
through the 1000 links in 1.html, but fails to do for 2.html, throwing the 
exception shown. I'm using 1 crawler.

What is the expected output? What do you see instead?
It should be able to crawl 2000 urls (1.html + 2.html), instead it only does 
for 1.html.

What version of the product are you using? On what operating system?
crawler4j - 2.6.1
Windows 7
jdk 6

Please provide any additional information below.

Here's the exception stacktrace.

Exception in thread "Crawler 2" java.lang.IllegalStateException: Can't open a 
cursor Database was closed.
    at com.sleepycat.je.Database.checkOpen(Database.java:1745)
    at com.sleepycat.je.Database.openCursor(Database.java:619)
    at edu.uci.ics.crawler4j.frontier.WorkQueues.get(WorkQueues.java:65)
    at edu.uci.ics.crawler4j.frontier.Frontier.getNextURLs(Frontier.java:98)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:92)
    at java.lang.Thread.run(Unknown Source)

Original issue reported on code.google.com by sham...@gmail.com on 19 Sep 2011 at 5:51

GoogleCodeExporter commented 9 years ago
Can you attach the source code of your controller class?

Original comment by ganjisaffar@gmail.com on 19 Sep 2011 at 6:43

GoogleCodeExporter commented 9 years ago
Thanks for looking into this. I've attached the controller and crawler code. 
The controller is taken from your sample. FYI, I was playing with the settings 
and decide to change the number of crawlers to 2 from 1. The error seemed to 
have gone after that.

Not sure how the number of crawler option works,probably need to take a look at 
your source code to understand.What I observed was, crawler 1 only crawled the 
seed page with reference to 1.html and 2.html. Crawler-2 crawled the 1000 links 
each inside 1.html and 2.html. Is there any way to control if I need a separate 
crawler thread to parse the links inside 1.html and 2.html respectively? 

Original comment by sham...@gmail.com on 19 Sep 2011 at 7:20

Attachments:

GoogleCodeExporter commented 9 years ago
Your code seems fine to me. The only reason that I can imagine this problem 
happening is when the controller thinks there is no other URL and closes the 
database while Crawler 2 is asking for new URLs. The expected behaviour is that 
all of the crawlers should be terminated before controller closes the database. 
Anyway, if you can checkout the source code and debug it you might find the 
exact problem.

-Yasser

Original comment by ganjisaffar@gmail.com on 22 Sep 2011 at 4:14

GoogleCodeExporter commented 9 years ago
This issue should be resolved in version 3.0

-Yasser

Original comment by ganjisaffar@gmail.com on 2 Jan 2012 at 7:26