jesbin / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Threads not being killed in graceful shutdown #296

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Create controller
2. crawl page with Control.start
3. wait for result.
4. control.shutdown
5. repeat. (or run as multi-thread runnables)

What is the expected output? What do you see instead?
thread level remains about the same.
result number threads increase, threads in controllers not terminated 
gracefully. 

What version of the product are you using?
3.5

Please provide any additional information below.

Original issue reported on code.google.com by rothschi...@gmail.com on 25 Aug 2014 at 12:13

GoogleCodeExporter commented 8 years ago
Can you please give me a little better scenario?

I am using BasicCrawler & BasicCrawlerController

What do you code after the final line in the controller ?
"controller.start(BasicCrawler.class, numberOfCrawlers);"

Original comment by avrah...@gmail.com on 25 Aug 2014 at 12:47

GoogleCodeExporter commented 8 years ago
We found 2 issues: 
The first was that the controller.shutdown() method does not close page 
fetcher. We had to separately run controller.getPageFetcher().shutdown();.

The second issue is that it creates threads when providing new paths for every 
iteration. For example, given the code:
    int i = 1;
            while (true) {
                CrawlConfig config = new CrawlConfig();
                config.setCrawlStorageFolder("crawlerStorge"+i);
                config.setPolitenessDelay(1000);
                config.setMaxDepthOfCrawling(-1);
                config.setMaxPagesToFetch(50);
                config.setResumableCrawling(false);
                PageFetcher pageFetcher = new PageFetcher(config);
                RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
                robotstxtConfig.setEnabled(false);
                RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
                CrawlController cc = new CrawlController(config, pageFetcher, robotstxtServer);

                cc.shutdown();
                cc.getPageFetcher().shutDown();

                System.out.println(i);
                i++;
            }

The Stacks and memory will become arbitrarily large, unless you change 
config.setCrawlStorageFolder("crawlerStorge"+i); to 
config.setCrawlStorageFolder("crawlerStorge");

Original comment by rothschi...@gmail.com on 25 Aug 2014 at 1:47

GoogleCodeExporter commented 8 years ago
Some of the code is missing...

cc.addseed...
cc.start...

Original comment by avrah...@gmail.com on 25 Aug 2014 at 2:16

GoogleCodeExporter commented 8 years ago
I inserted the pageFetcher closing into the code (will be available in the 
coming commit).

According to your description, it seems as if the crawler should delete the 
in-memory DB when shutting down - can you find that place in the code and 
submit a patch ?

Original comment by avrah...@gmail.com on 25 Aug 2014 at 2:27

GoogleCodeExporter commented 8 years ago
Look at these changes - maybe they will help:
https://code.google.com/r/lallianmawia-crawler4j/source/detail?r=82db8ca628af2cb
a729e18156e911d13511ff575

Original comment by avrah...@gmail.com on 25 Aug 2014 at 2:32

GoogleCodeExporter commented 8 years ago
Excellent, we were just about to point out that there was no proper shutdown 
for the Environment object. The changes you've listed look good.

Thank you for the prompt response.

Original comment by rothschi...@gmail.com on 25 Aug 2014 at 2:49

GoogleCodeExporter commented 8 years ago
Sure, no problem.

But please come back and verify if those changes really work for you, so I can 
embed them internally so they will be there at the next release for the benefit 
of all.

Original comment by avrah...@gmail.com on 25 Aug 2014 at 2:51

GoogleCodeExporter commented 8 years ago
For some reason, the change has caused the RobotstxtServer object to return 
null (using the code from above), causing the following Exception: 
Exception in thread "main" java.lang.NoClassDefFoundError: 
com/sleepycat/je/EnvironmentConfig
    at edu.uci.ics.crawler4j.crawler.CrawlController.<init>(CrawlController.java:97)
    at com.tests.CrawlGenTest.testIT(CrawlGenTest.java:105)
    at com.tests.CrawlGenTest.main(CrawlGenTest.java:23)
Caused by: java.lang.ClassNotFoundException: com.sleepycat.je.EnvironmentConfig
    at java.net.URLClassLoader$1.run(Unknown Source)
    at java.net.URLClassLoader$1.run(Unknown Source)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    ... 3 more

Original comment by rothschi...@gmail.com on 26 Aug 2014 at 6:31

GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
This exception is weird as it is thrown when a jar file is outdated or 
missing...

Please clone the latest of crawler4j and retry.

Just for reference, I found this stackoverflow question asked 2 years back:
http://stackoverflow.com/questions/12160206/nosuchmethoderror-in-crawler4j-crawe
lcontroller-class

Original comment by avrah...@gmail.com on 26 Aug 2014 at 7:17

GoogleCodeExporter commented 8 years ago
thanks for your help, after a bunch of testing we have understood there are two 
situations one if crawler control has started and the other if not.
created a forceShutdown for cases it have not started controller.

Also we have found an issue caused by the truncate process. 

to resolve this while using repeating Path names for Crawler Storage, we have 
removed all calls to environment truncate data base.

https://code.google.com/r/yonid-crawler4j/source/detail?r=0097f6cbc915fe3e323a82
a5f85d38d120e972bd

cheers

Original comment by yo...@evercompliant.com on 26 Aug 2014 at 2:57

GoogleCodeExporter commented 8 years ago
So the final changes you have are only in the CrawlController right?
1. The env close
2. the public forceShutdown()

Original comment by avrah...@gmail.com on 26 Aug 2014 at 3:16

GoogleCodeExporter commented 8 years ago
Fixed at revision: 2264d63b4c20 

Closed the PageFetcher at Shutdown.
(The ENV closing was already done in one of the previous commits)

Original comment by avrah...@gmail.com on 1 Sep 2014 at 7:37