Crawl Never Starts Final Cleanup

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. I run a depth 0 crawl on a small set of urls.
2. After the crawler detects it is finished I receive:
 INFO [Thread-7] It looks like no thread is working, waiting for 10 seconds to make sure...
 INFO [Thread-7] No thread is working and no more URLs are in queue waiting for another 10 seconds to make sure...
 INFO [Thread-7] All of the crawlers are stopped. Finishing the process...
 INFO [Thread-7] Waiting for 10 seconds before final clean up...

The crawler then never moves onto the final clean up stage.  My system activity 
monitor still shows that the crawler has all of the threads used active.

What is the expected output? What do you see instead?
The crawler to move into the final cleanup stage and complete

What version of the product are you using?
3.1

Please provide any additional information below.

Original issue reported on code.google.com by chrstah...@gmail.com on 14 Feb 2012 at 4:05

GoogleCodeExporter commented 9 years ago

I see now it does end, if you add a Print statement after 
controller.start(MyCrawler.class, numberOfCrawlers); it wil fire after the 10 
seconds.  Maybe adding a print line in saying "Final clean up complete" might 
help make this more clear?

Original comment by chrstah...@gmail.com on 6 Mar 2012 at 3:51

GoogleCodeExporter commented 9 years ago

Hi, good afternoon! My name is Edwaldo am Brazilian and I am now starting a 
project where I will use the Crawler4j. I studied all your documentation and 
implemented the code, however, I am unable to make the collection. I always 
returns the following information: 

Deleting content of: 
D:\eclipse\EclipsePortableJava\Data\workspace\WebCrawler\intermediario\frontier
 INFO [main] Crawler 1 started.
 INFO [main] Crawler 2 started.
 INFO [main] Crawler 3 started.
 INFO [main] Crawler 4 started.
 INFO [main] Crawler 5 started.
 INFO [main] Crawler 6 started.
 INFO [main] Crawler 7 started.
 INFO [main] Crawler 8 started.
 INFO [main] Crawler 9 started.
 INFO [main] Crawler 10 started.
Docid: 1
URL: http://www.submarino.com.br/
Domain: 'submarino.com.br'
Sub-domain: 'www'
Path: '/'
Parent page: null
Anchor text: null
Text length: 43621
Html length: 235817
Number of outgoing links: 613
Response headers:
    X-Powered-By: Servlet/2.5 JSP/2.1
    X-Powered-By: JSF/1.2
    Content-Encoding: gzip
    Content-Type: text/html; charset=UTF-8
    Expires: Fri, 09 May 2014 19:49:38 GMT
    Cache-Control: max-age=0, no-cache, no-store
    Pragma: no-cache
    Date: Fri, 09 May 2014 19:49:38 GMT
    Content-Length: 33259
    Connection: keep-alive
    Vary: Accept-Encoding
    Set-Cookie: acomChannel=INTERNET; path=/; domain=submarino.com.br
    Set-Cookie: b2wChannel=INTERNET; path=/; domain=submarino.com.br
    Set-Cookie: akaau=1399665278~id=3010416469baa56f7d459fb7d3d19525; path=/
=============
 INFO [Thread-1] It looks like no thread is working, waiting for 10 seconds to make sure...
 INFO [Thread-1] No thread is working and no more URLs are in queue waiting for another 10 seconds to make sure...
 INFO [Thread-1] All of the crawlers are stopped. Finishing the process...
 INFO [Thread-1] Waiting for 10 seconds before final clean up...

As you can see it runs Crawler4j it, so that you can verify the amount of links 
found with the last seed, however, later gives an error as if the treads were 
not working. 

Could anyone help me? 
Urgently needed. 
Thank you!

Original comment by edwaldos...@gmail.com on 10 May 2014 at 7:40

GoogleCodeExporter commented 9 years ago

Not a bug or feature request

Original comment by avrah...@gmail.com on 11 Aug 2014 at 1:10

Changed state: Invalid

sageone / crawler4j

Crawl Never Starts Final Cleanup #122