sawantuday / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Cannot delete frontier temp folder #157

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hi, I am trying to perform schedule-based crawling of the same page.
First time it crawls good, but after crawl there are files in the frontier 
folder (jdb). I start another one crawl (create a new crawler instance), it 
says "deleting frontier folder" but actually do nothing, and crawls nothing. If 
i change the frontier folder at the second crawl - everything works.

Q: What am I doing wrong? How can I delete those files after crawl?

PS: I dont use resumable crawling.

Thank you.

Original issue reported on code.google.com by plotniko...@gmail.com on 1 Jun 2012 at 9:13

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Any updates or good work arounds on this issue? I'm experiencing the same 
problem.

Original comment by gmei...@gmail.com on 7 Jun 2012 at 3:58

GoogleCodeExporter commented 9 years ago
I've modified sources -
CrawlerController.java:
....
    protected final Object waitingLock = new Object();

    //plotn
    public Environment env; 
    //\plotn

    public CrawlController(CrawlConfig config, PageFetcher pageFetcher, RobotstxtServer robotstxtServer)
            throws Exception {
        super(config);
....
        if (!resumable) {
            IO.deleteFolderContents(envHome);
        }
        //plotn
        //Environment env = new Environment(envHome, envConfig);
        env = new Environment(envHome, envConfig);
        //\plotn

        docIdServer = new DocIDServer(env, config);
....
frontier.close();
                                        docIdServer.close();
                                        pageFetcher.shutDown();

                                        finished = true;
                                        waitingLock.notifyAll();

                                        //plotn
                                        env.close();
                                        //\plotn

                                        return;
....
Counters.java:
....
public void close() {
        try {
            if (statisticsDB != null) {
                statisticsDB.close();
                //plotn
                try {
                    System.out.println("Database Statistics deteting...");
                    env.truncateDatabase(null, "Statistics",false);
                    env.removeDatabase(null, "Statistics");
                } catch (Exception e) {
                    e.printStackTrace();
                    System.out.println("Database deteting error: "+e.getClass()+" "+(e.getMessage()==null?"":e.getMessage()));
                }
                //\plotn
            }   
        } catch (DatabaseException e) {
            e.printStackTrace();
        } 
....
DocIDServer.java:
....
    protected int lastDocID;

    //plotn
    protected Environment env;
    //\plotn

    public DocIDServer(Environment env, CrawlConfig config) throws DatabaseException {
        super(config);
        //plotn
        this.env = env;
        //\pltn
        DatabaseConfig dbConfig = new DatabaseConfig(); 
....
public void close() {
        try {
            docIDsDB.close();
            //plotn
            try {
                System.out.println("Database DocIDs deteting...");
                env.truncateDatabase(null, "DocIDs", false);
                env.removeDatabase(null, "DocIDs");  
            } catch (Exception e) {
                e.printStackTrace();
                System.out.println("Database deteting error: "+e.getClass()+" "+(e.getMessage()==null?"":e.getMessage()));
            }
            //\plotn
        } catch (DatabaseException e) {
            e.printStackTrace();
        }
    } 
....
WorkQueues.java:
....
    protected final Object mutex = new Object();

    //plotn
    protected String dbName;
    //\plotn

    public WorkQueues(Environment env, String dbName, boolean resumable) throws DatabaseException {
        this.env = env;
        this.resumable = resumable;
        DatabaseConfig dbConfig = new DatabaseConfig();
        dbConfig.setAllowCreate(true);
        dbConfig.setTransactional(resumable);
        dbConfig.setDeferredWrite(!resumable);
        //plotn
        this.dbName=dbName;
        //\plotn
        urlsDB = env.openDatabase(null, dbName, dbConfig);
....
public void close() {
        try {
            urlsDB.close();
            //plotn
            try {
                System.out.println("Database "+dbName+" deteting...");
                env.truncateDatabase(null, dbName, false);
                env.removeDatabase(null, dbName);  
            } catch (Exception e) {
                e.printStackTrace();
                System.out.println("Database deteting error: "+e.getClass()+" "+(e.getMessage()==null?"":e.getMessage()));
            }
            //\plotn
        } catch (DatabaseException e) {
            e.printStackTrace();
        }
    } 
....

Original comment by plotniko...@gmail.com on 7 Jun 2012 at 6:25

GoogleCodeExporter commented 9 years ago
After crawl I do now:
controller.env.close();

then delete contents. Dont know is is correct or not, but it works. Feel free 
to write to plotnikovya@gmail.com.

updated jar file in attach

Original comment by plotniko...@gmail.com on 7 Jun 2012 at 6:28

Attachments:

GoogleCodeExporter commented 9 years ago
This is really a big problem. It's actually useless if I can crawl only once.

Original comment by lallian....@utsavfashion.com on 30 Jan 2014 at 8:11

GoogleCodeExporter commented 9 years ago
I am changing the source code as posted by plotnikovya@gmail.com and I think 
it's working. If there is any bug with this change, please leave a comment here.

Original comment by lallian....@utsavfashion.com on 31 Jan 2014 at 7:19

GoogleCodeExporter commented 9 years ago
Also, I suggest that this change be made on the source code.

Original comment by lallian....@utsavfashion.com on 31 Jan 2014 at 7:22

GoogleCodeExporter commented 9 years ago
I m also facing the same the problem i downloaded crawler4j 3.5 and still 
facing the same problem ...

Original comment by kalia.ro...@gmail.com on 8 Aug 2014 at 5:10

GoogleCodeExporter commented 9 years ago
Closing now the environment as it should.

Fix done on revision hash: 9efaeef20c30 

Will be released in the next release (v3.6)

Original comment by avrah...@gmail.com on 11 Aug 2014 at 2:03

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 11 Aug 2014 at 2:04