mohankreddy / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Graceful stop/abort - good to have #11

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hi,
Since the tomcat is shared, it can't be stopped to stop a running crawler. 
It'll be good to have the ability to 'trigger' a stop/abort the running 
crawler, which will stop once all the running threads are done.

What steps will reproduce the problem?
-NA

What is the expected output? What do you see instead?
-NA

What version of the product are you using? On what operating system?
-1.8

Please provide any additional information below.

Thanks.

Original issue reported on code.google.com by pkal...@gmail.com on 6 Aug 2010 at 7:15

GoogleCodeExporter commented 9 years ago
This is a good-to-have feature. In one of the older versions, I added some code 
to trap Ctrl+C. Then I displayed a message to press enter. If pressed, the 
program terminates immediately without waiting the threads to consume their 
work queue after finishing crawling the current page. Otherwise, all the 
threads will crawl their workqueues and terminate without fetching a new batch. 
I also added some other options like limiting the number of pages to fetch.Once 
the limit is reached, the Frontier stops distributing new urls to the threads. 
Eventually, all become idle and the crawler terminates.

Original comment by ggec2...@gmail.com on 28 Nov 2010 at 12:00

GoogleCodeExporter commented 9 years ago
Thanks for your comments.
However, think of the situation where you either don;t have access to the
production machine running the crawler, or don't want to login to physical
machine and issue a ctl+c.

Original comment by pkal...@gmail.com on 29 Nov 2010 at 5:38

GoogleCodeExporter commented 9 years ago
Crawler is not releasing resource when it finish. It keep block the file 
frontier. 

Original comment by mishra....@gmail.com on 16 Dec 2010 at 9:16

GoogleCodeExporter commented 9 years ago
Yesterday, i had the same issue. Today i fix with this trick. Not is 
immediately but is enough for me.

public boolean shouldVisit(WebURL url)
{
    if(crawlerMustStop())
    {
       return false;
    }
    ...

Original comment by DLopezGo...@gmail.com on 15 Jun 2011 at 9:38

GoogleCodeExporter commented 9 years ago
This is not really a fix for stopping the crawler.

The "shouldVisit()" is called for every link on a page just
downloaded/parsed, so returning false here will just not add to the
pending-urls list, but crawler still has to go thru existing pending-url
list, thus is not really stopping.

Original comment by pkal...@gmail.com on 15 Jun 2011 at 11:58

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
I know that, but without modify the code is the unique way that i saw.

The modifications that i did for stop the crawler are these.

In run() of WebCrawler class:

if(!isFinish()) {
processPage(curURL);
}

In WebCrawler:
private finish = false;
public boolean isFinish()
{
return finish;
}

protected setFinish()
{
finish = true;
}

In start() of CrawController

if (!someoneIsWorking)
{
//BEGIN CHANGE

        Iterator<WebCrawler> it = (Iterator<WebCrawler>) crawlers.iterator();
        boolean bAllFinish = true;
        while(it.hasNext())
        {
            WebCrawler wb = it.next();
            bAllFinish = bAllFinish && wb.isFinish();
        }

        if(bAllFinish)
        {
            logger.info("All of the crawlers are finished by some condition. Finishing the process...");
            for (T crawler : crawlers) {
                    crawler.onBeforeExit();
                    crawlersLocalData.add(crawler.getMyLocalData());
            }

            // At this step, frontier notifies the threads that were waiting for new URLs and they should stop
            // We will wait a few seconds for them and then return.
            Frontier.finish();
            logger.info("Waiting for 10 seconds before final clean up...");
            sleep(10);

            Frontier.close();
            //*NOTE PageFetcher.stopConnectionMonitorThread();
            return;
        }
//END CHANGE 
        // Make sure again that none of the threads are alive.
        logger.info("It looks like no thread is working, waiting for 40 seconds to make sure...");
        sleep(40);

Note: 

For my convenience, i needed take out the "PageFetcher" from the 
"CrawController", because i need to launch many times the crawler with diferent 
seeds. I use the crawerl4j for search some links and i need stop and relaunch 
the crawler. By the momento i have not problems with this.

PageFetcher.stopConnectionMonitorThread() and 
PageFetcher.startConnectionMonitorThread() are out of CrawController.

For stop crawler I call the new isFinish() method of WebCrawler class 

If i can help in the proyect, i am here. 

Original comment by DLopezGo...@gmail.com on 16 Jun 2011 at 10:30

GoogleCodeExporter commented 9 years ago
As of version 3.0, graceful stop is supported. See 
http://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/crawl
er4j/examples/shutdown/ for a sample.

-Yasser

Original comment by ganjisaffar@gmail.com on 2 Jan 2012 at 3:55

GoogleCodeExporter commented 9 years ago
Hi, I will like to check if the issue of tomcat with the frontier of crawler4j 
has been resolved? I am using version 3.3, I realized that depsite setting 
resumable crawling to be false, the crawling still seems to be a resumable one, 
i.e I cannot start a new crawl with the same seed. I have been trying all ways 
to tackle the issue for some time  but still could not get it work. 

Original comment by kohl0...@gmail.com on 27 Apr 2013 at 1:46