Closed GoogleCodeExporter closed 9 years ago
Can you please give me a little better scenario?
I am using BasicCrawler & BasicCrawlerController
What do you code after the final line in the controller ?
"controller.start(BasicCrawler.class, numberOfCrawlers);"
Original comment by avrah...@gmail.com
on 25 Aug 2014 at 12:47
We found 2 issues:
The first was that the controller.shutdown() method does not close page
fetcher. We had to separately run controller.getPageFetcher().shutdown();.
The second issue is that it creates threads when providing new paths for every
iteration. For example, given the code:
int i = 1;
while (true) {
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder("crawlerStorge"+i);
config.setPolitenessDelay(1000);
config.setMaxDepthOfCrawling(-1);
config.setMaxPagesToFetch(50);
config.setResumableCrawling(false);
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
robotstxtConfig.setEnabled(false);
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController cc = new CrawlController(config, pageFetcher, robotstxtServer);
cc.shutdown();
cc.getPageFetcher().shutDown();
System.out.println(i);
i++;
}
The Stacks and memory will become arbitrarily large, unless you change
config.setCrawlStorageFolder("crawlerStorge"+i); to
config.setCrawlStorageFolder("crawlerStorge");
Original comment by rothschi...@gmail.com
on 25 Aug 2014 at 1:47
Some of the code is missing...
cc.addseed...
cc.start...
Original comment by avrah...@gmail.com
on 25 Aug 2014 at 2:16
I inserted the pageFetcher closing into the code (will be available in the
coming commit).
According to your description, it seems as if the crawler should delete the
in-memory DB when shutting down - can you find that place in the code and
submit a patch ?
Original comment by avrah...@gmail.com
on 25 Aug 2014 at 2:27
Look at these changes - maybe they will help:
https://code.google.com/r/lallianmawia-crawler4j/source/detail?r=82db8ca628af2cb
a729e18156e911d13511ff575
Original comment by avrah...@gmail.com
on 25 Aug 2014 at 2:32
Excellent, we were just about to point out that there was no proper shutdown
for the Environment object. The changes you've listed look good.
Thank you for the prompt response.
Original comment by rothschi...@gmail.com
on 25 Aug 2014 at 2:49
Sure, no problem.
But please come back and verify if those changes really work for you, so I can
embed them internally so they will be there at the next release for the benefit
of all.
Original comment by avrah...@gmail.com
on 25 Aug 2014 at 2:51
For some reason, the change has caused the RobotstxtServer object to return
null (using the code from above), causing the following Exception:
Exception in thread "main" java.lang.NoClassDefFoundError:
com/sleepycat/je/EnvironmentConfig
at edu.uci.ics.crawler4j.crawler.CrawlController.<init>(CrawlController.java:97)
at com.tests.CrawlGenTest.testIT(CrawlGenTest.java:105)
at com.tests.CrawlGenTest.main(CrawlGenTest.java:23)
Caused by: java.lang.ClassNotFoundException: com.sleepycat.je.EnvironmentConfig
at java.net.URLClassLoader$1.run(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
... 3 more
Original comment by rothschi...@gmail.com
on 26 Aug 2014 at 6:31
[deleted comment]
This exception is weird as it is thrown when a jar file is outdated or
missing...
Please clone the latest of crawler4j and retry.
Just for reference, I found this stackoverflow question asked 2 years back:
http://stackoverflow.com/questions/12160206/nosuchmethoderror-in-crawler4j-crawe
lcontroller-class
Original comment by avrah...@gmail.com
on 26 Aug 2014 at 7:17
thanks for your help, after a bunch of testing we have understood there are two
situations one if crawler control has started and the other if not.
created a forceShutdown for cases it have not started controller.
Also we have found an issue caused by the truncate process.
to resolve this while using repeating Path names for Crawler Storage, we have
removed all calls to environment truncate data base.
https://code.google.com/r/yonid-crawler4j/source/detail?r=0097f6cbc915fe3e323a82
a5f85d38d120e972bd
cheers
Original comment by yo...@evercompliant.com
on 26 Aug 2014 at 2:57
So the final changes you have are only in the CrawlController right?
1. The env close
2. the public forceShutdown()
Original comment by avrah...@gmail.com
on 26 Aug 2014 at 3:16
Fixed at revision: 2264d63b4c20
Closed the PageFetcher at Shutdown.
(The ENV closing was already done in one of the previous commits)
Original comment by avrah...@gmail.com
on 1 Sep 2014 at 7:37
Original issue reported on code.google.com by
rothschi...@gmail.com
on 25 Aug 2014 at 12:13