tasfe / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 1 forks source link

Webcrawler freeze when server is not available #287

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
Use webcrawler to a url as seed that is down. Now, for example, 
http://jquery.com/ is not working and webcrawler is freeze if I use this page. 

What is the expected output? What do you see instead?
At least, return server not found, page unavailable or something similar. 

What version of the product are you using?
Only tested with version 3.5

Original issue reported on code.google.com by jorgehor...@gmail.com on 19 Aug 2014 at 8:16

GoogleCodeExporter commented 9 years ago
Great scenario - thank you.

My problem is that I doubt that jquery will stay down for long so the scenario 
might disappear...

Original comment by avrah...@gmail.com on 19 Aug 2014 at 8:19

GoogleCodeExporter commented 9 years ago
This was luck, I use this page for my tests and today my tests are freeze. 
Probably you can found any other webpage (or maybe an invalid URL?). 

This is my last issue :-) No more problems found in my last two weeks of using 
the crawler4j. 

Original comment by jorgehor...@gmail.com on 19 Aug 2014 at 8:23

GoogleCodeExporter commented 9 years ago
No, please report any issue you find - it is of great help to this open source 
project.

And don't worry, I can always set politeness to 0 and then bombard a site till 
it falls and have my scenario  :-)

Original comment by avrah...@gmail.com on 19 Aug 2014 at 8:25

GoogleCodeExporter commented 9 years ago
Well, here is my logs when crawling jquery while it is down:

11:30:05.118 [main] ERROR e.u.i.crawler4j.fetcher.PageFetcher - Fatal transport 
error: Read timed out while fetching http://jquery.com/robots.txt (link found 
in doc #0)
11:30:05.210 [main] INFO  e.u.i.c.crawler.CrawlController - Crawler 1 started
11:30:25.237 [Crawler 1] ERROR e.u.i.crawler4j.fetcher.PageFetcher - Fatal 
transport error: Read timed out while fetching http://jquery.com/ (link found 
in doc #0)
11:30:25.238 [Crawler 1] ERROR e.u.ics.crawler4j.crawler.WebCrawler - null, 
while processing: http://jquery.com/
11:30:35.238 [Thread-1] INFO  e.u.i.c.crawler.CrawlController - It looks like 
no thread is working, waiting for 10 seconds to make sure...
11:30:45.242 [Thread-1] INFO  e.u.i.c.crawler.CrawlController - No thread is 
working and no more URLs are in queue waiting for another 10 seconds to make 
sure...
11:30:55.243 [Thread-1] INFO  e.u.i.c.crawler.CrawlController - All of the 
crawlers are stopped. Finishing the process...
11:30:55.243 [Thread-1] INFO  e.u.i.c.crawler.CrawlController - Waiting for 10 
seconds before final clean up...

Original comment by avrah...@gmail.com on 19 Aug 2014 at 8:32

GoogleCodeExporter commented 9 years ago
Your logs make sense. Then my code probably has caused the freeze of your 
library (and jquesy have not banned me!). I will check and I will confirm you 
asap. 

Original comment by jorgehor...@gmail.com on 19 Aug 2014 at 8:50

GoogleCodeExporter commented 9 years ago
Find my issue. It was a typo in my test cases. I have set the crawling threads 
to '-1' then, nothing is done (no message, no errors, no nothing) and software 
is frozen. It was a stupid error on my side.

Original comment by jorgehor...@gmail.com on 19 Aug 2014 at 9:00

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 19 Aug 2014 at 9:20