crawler4j suddenly freezes (i.e. no more info, no more crawling)

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?

Hard to say, as I only recently encountered it (before that I never had 
problems). 

The situation: I am crawling a bulk of 'profile pages' (i.e. similar pages with 
similar data) on a rather standard site. I do this in the standard way: add 
seeds, be polite, just 1 crawler, no https involved, etc. Nothing complex (no 
exotic shouldVisit-demands, deep domain crawling, or anything else): just 'go 
to each page that is given via the seeds, get some data, save it in a SQL 
database and go to the next page'. One thing though: I do add a rather large 
list of (seeds for the) profile pages, easily a few hundred.

What is the expected output? What do you see instead?

Expected output: a nicely filled database.

What do I get instead: the crawler simply freezes every time I try and always 
at the same point (15th item)! With 'freezes' I mean: no more system out data 
(not from me via System.out, not from crawler4j, no errors, nothing). It just 
keeps hanging and no crawled data comes in (so crawling seems to stop).

What version of the product are you using?

crawler4j-3.3.

Please provide any additional information below.

I intentionally sketched the problem conceptually. 
If you recognize anything or are eager to help and need more specific 
information, please let me know and I'll get right back to you! 

I'm on a tight schedule here and really, any help is very welcome and much 
appreciated!

Original issue reported on code.google.com by misterbu...@gmail.com on 11 Oct 2012 at 8:29

GoogleCodeExporter commented 9 years ago

I got the same problem while having about 100 k + seeds in the db. After a 
while and after restart  the crawler seems to freeze. There is a lot of traffic 
on the hard drive, looks like the db is reorganized or locked.

Did someone solve that problem?

Original comment by an.h...@gmail.com on 16 Dec 2012 at 11:26

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

I am also facing the same problem freezing after a specified interval. Please 
give your valuable suggestions.

Original comment by c.krithi...@gmail.com on 24 Mar 2013 at 7:09

GoogleCodeExporter commented 9 years ago

Please set config.setResumableCrawling to true and restart the crawler it will 
start fetching from the last stopped location

Original comment by c.krithi...@gmail.com on 31 May 2014 at 6:00

GoogleCodeExporter commented 9 years ago

setting config.setResumableCrawling is not an option for me. Anyone figure out 
what is causing this?

Original comment by jeger...@gmail.com on 17 Jul 2014 at 2:55

GoogleCodeExporter commented 9 years ago

I need a more specific scenario.

I need the list of URLs to test for myself.

Original comment by avrah...@gmail.com on 11 Aug 2014 at 2:14

GoogleCodeExporter commented 9 years ago

Closed due to inactivity and no good scenario

Original comment by avrah...@gmail.com on 23 Sep 2014 at 2:05

Changed state: Invalid

mohankreddy / crawler4j

crawler4j suddenly freezes (i.e. no more info, no more crawling) #176