Resume Crawl - Enhancement

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Start a crawl that's going to take 8 weeks
2. terminate a crawl (unexpectedly) after 6 weeks 
3. start a new crawl - 8 more weeks (maybe)! 

What is the expected output? What do you see instead?
Would expect to open up the databases, continue through the work queues, not 
schedule any visits if the docID already exists. Currently must start over and 
revisit all pages.

What version of the product are you using? On what operating system?
c4j 2.2 / Mac OS X Server 10.6.4

Please provide any additional information below.
I'm trying to implement this but don't have any experience with BerkeleyDB. 
I've eliminated the home folder deletion and local data purge from the 
constructor and start methods of the controller. I've also turned off 
deferredWrite for the DocIDsDB but it's reporting a 0 count after terminating 
and running the program.

Is this a trivial change to implement? If so can you point me in the right 
direction? If not, can you help me understand what are the challenges? I'm 
going to keep trying but if you have any pointers that'd be appreciated.

Thanks!

Original issue reported on code.google.com by EvanDavi...@gmail.com on 7 Oct 2010 at 5:49

GoogleCodeExporter commented 9 years ago

Update: I implemented transactions and it seems to be working. For anyone else 
that needs to do this, just setTransactional(true) for the env and each 
database. setLocking(true) for the env. Provide a transaction for the cursors 
and commit them after the cursor is closed. I'm sure this degrades performance 
but it works for me. If you have any comments about side effects I may not have 
anticipated, please let me know.

Original comment by EvanDavi...@gmail.com on 8 Oct 2010 at 4:10

GoogleCodeExporter commented 9 years ago

i need it too, it very helpful, thank you!

Original comment by wanxiang.xing@gmail.com on 24 Mar 2011 at 7:46

GoogleCodeExporter commented 9 years ago

how to do this in code? i understand what u did but i couldnt implement it

Original comment by utkuson...@yahoo.com on 21 May 2011 at 11:10

GoogleCodeExporter commented 9 years ago

If you checkout the latest version from svn it supports resuming. You just need 
to enable it in the crawler4j.properties file.

-Yasser

Original comment by ganjisaffar@gmail.com on 21 May 2011 at 4:08

GoogleCodeExporter commented 9 years ago

As of version 3.0, crawler4j supports resuming. See 
http://code.google.com/p/crawler4j/wiki/Configurations#Resumable_Crawling for 
more details.

-Yasser

Original comment by ganjisaffar@gmail.com on 2 Jan 2012 at 3:56

Changed state: Done
Added labels: Type-Enhancement
Removed labels: Type-Defect

mohankreddy / crawler4j

Resume Crawl - Enhancement #17