Closed GoogleCodeExporter closed 9 years ago
This is a good-to-have feature. In one of the older versions, I added some code
to trap Ctrl+C. Then I displayed a message to press enter. If pressed, the
program terminates immediately without waiting the threads to consume their
work queue after finishing crawling the current page. Otherwise, all the
threads will crawl their workqueues and terminate without fetching a new batch.
I also added some other options like limiting the number of pages to fetch.Once
the limit is reached, the Frontier stops distributing new urls to the threads.
Eventually, all become idle and the crawler terminates.
Original comment by ggec2...@gmail.com
on 28 Nov 2010 at 12:00
Thanks for your comments.
However, think of the situation where you either don;t have access to the
production machine running the crawler, or don't want to login to physical
machine and issue a ctl+c.
Original comment by pkal...@gmail.com
on 29 Nov 2010 at 5:38
Crawler is not releasing resource when it finish. It keep block the file
frontier.
Original comment by mishra....@gmail.com
on 16 Dec 2010 at 9:16
Yesterday, i had the same issue. Today i fix with this trick. Not is
immediately but is enough for me.
public boolean shouldVisit(WebURL url)
{
if(crawlerMustStop())
{
return false;
}
...
Original comment by DLopezGo...@gmail.com
on 15 Jun 2011 at 9:38
This is not really a fix for stopping the crawler.
The "shouldVisit()" is called for every link on a page just
downloaded/parsed, so returning false here will just not add to the
pending-urls list, but crawler still has to go thru existing pending-url
list, thus is not really stopping.
Original comment by pkal...@gmail.com
on 15 Jun 2011 at 11:58
[deleted comment]
[deleted comment]
I know that, but without modify the code is the unique way that i saw.
The modifications that i did for stop the crawler are these.
In run() of WebCrawler class:
if(!isFinish()) {
processPage(curURL);
}
In WebCrawler:
private finish = false;
public boolean isFinish()
{
return finish;
}
protected setFinish()
{
finish = true;
}
In start() of CrawController
if (!someoneIsWorking)
{
//BEGIN CHANGE
Iterator<WebCrawler> it = (Iterator<WebCrawler>) crawlers.iterator();
boolean bAllFinish = true;
while(it.hasNext())
{
WebCrawler wb = it.next();
bAllFinish = bAllFinish && wb.isFinish();
}
if(bAllFinish)
{
logger.info("All of the crawlers are finished by some condition. Finishing the process...");
for (T crawler : crawlers) {
crawler.onBeforeExit();
crawlersLocalData.add(crawler.getMyLocalData());
}
// At this step, frontier notifies the threads that were waiting for new URLs and they should stop
// We will wait a few seconds for them and then return.
Frontier.finish();
logger.info("Waiting for 10 seconds before final clean up...");
sleep(10);
Frontier.close();
//*NOTE PageFetcher.stopConnectionMonitorThread();
return;
}
//END CHANGE
// Make sure again that none of the threads are alive.
logger.info("It looks like no thread is working, waiting for 40 seconds to make sure...");
sleep(40);
Note:
For my convenience, i needed take out the "PageFetcher" from the
"CrawController", because i need to launch many times the crawler with diferent
seeds. I use the crawerl4j for search some links and i need stop and relaunch
the crawler. By the momento i have not problems with this.
PageFetcher.stopConnectionMonitorThread() and
PageFetcher.startConnectionMonitorThread() are out of CrawController.
For stop crawler I call the new isFinish() method of WebCrawler class
If i can help in the proyect, i am here.
Original comment by DLopezGo...@gmail.com
on 16 Jun 2011 at 10:30
As of version 3.0, graceful stop is supported. See
http://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/crawl
er4j/examples/shutdown/ for a sample.
-Yasser
Original comment by ganjisaffar@gmail.com
on 2 Jan 2012 at 3:55
Hi, I will like to check if the issue of tomcat with the frontier of crawler4j
has been resolved? I am using version 3.3, I realized that depsite setting
resumable crawling to be false, the crawling still seems to be a resumable one,
i.e I cannot start a new crawl with the same seed. I have been trying all ways
to tackle the issue for some time but still could not get it work.
Original comment by kohl0...@gmail.com
on 27 Apr 2013 at 1:46
Original issue reported on code.google.com by
pkal...@gmail.com
on 6 Aug 2010 at 7:15