mohankreddy / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Not possible to run a crawl more than once. #181

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Create a CrawlController and start it.
2. Let it finish
3. Create another CrawlController just the same as the first and start it.

What is the expected output? What do you see instead?
I expect that the same crawl to be performed, instead since it seems the 
Berkley DB keeps the database in-memory and even though the files in the 
frontier folder are deleted the 'DocIDs' is not empty. This causes the crawl to 
not process any pages since the seed url:s already are in the DocIDs database.

What version of the product are you using?
3.3

Please provide any additional information below.
I solved it in my clone from GitHub like this:
In the CrawlController constructor i added tree lines of code after the 
Environment is created:
        if(!resumable) {
            deleteExistingDatabases(env);
        }

The deleteExistingDatabases-method looks like this:

    private void deleteExistingDatabases(Environment env) {
        try {
            List<String> databases = env.getDatabaseNames();
            for (String database : databases) {
                logger.debug("removing database '" + database + "'");
                env.removeDatabase(null, database);
            }
        } catch (DatabaseException e) {
            logger.warn("Error while deleteing databases, ignoring...", e);
        }
    }

Original issue reported on code.google.com by jorgen.l...@gmail.com on 27 Nov 2012 at 10:43

Attachments:

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:30

GoogleCodeExporter commented 9 years ago
I'm glad so see that there's activity in this project again.

I'm attaching the final version of my clone of the CrawlController. I thought 
that there was too much going on in the constructor and refactored most of it 
out into a public "configure"-method that does most of the stuff that was in 
the controller before. This also makes it possible to override the 
configure-method to provide your own configuration.

The configure-method was split up into smaller protected-methods if you only 
want to override some part of the configuration.

Also, it throws RuntimeException instead of Exception.

Anyway, hope it will be useful. The file is attached to this comment.

Cheers,

/Jörgen

Original comment by jorgen.l...@gmail.com on 19 Aug 2014 at 7:44

Attachments:

GoogleCodeExporter commented 9 years ago
Thank you.

We have much on our plate, but we will get to this one also eventually.

Original comment by avrah...@gmail.com on 19 Aug 2014 at 7:47