mohankreddy / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Fatal transport error while fetching robots.txt #199

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1.I downloaded the code from the project home for crawler and controller and 
rest of it.
2.I made the class path and classes.
3. I have used the URL: www.patentstorm.us
4.Number of threads is kept 1
5. I am using default user agent i.e. "crawler4j 
(http://code.google.com/p/crawler4j/)"

What is the expected output? What do you see instead?
Ans: The code should crawl the URL and fetch the data from it.

ERROR [main] Fatal transport error: www.patentstorm.us while fetching 
http://www.patentstorm.us/robots.txt (link found in doc #0)

What version of the product are you using?
Ans: Version 3.3

Please provide any additional information below.
When I try to crawl  websites which doesn't have robots.txt, it gets crawled 
easily.
So what user agent should I use to fix the error?

Original issue reported on code.google.com by ursrak...@gmail.com on 1 Mar 2013 at 6:36

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Hi ,Please help on this.
Getting this error in Crawler "Fatal transport error: null while fetching 
https://www.stg.nylinvestments.com/mainstay (link found in doc #0)"
Crawler -4j version -3.6.How to resolve this issue.

Original comment by aysthi...@gmail.com on 10 Feb 2014 at 1:32

GoogleCodeExporter commented 9 years ago
Hi, Please help on this error.
ERROR [main] Fatal transport error: Connection to http://www.ics.uci.edu 
refused while fetching http://www.ics.uci.edu/robots.txt (link found in doc #0)
 INFO [main] Crawler 1 started.
 INFO [main] Crawler 2 started.
ERROR [Crawler 2] Fatal transport error: Connection to http://www.ics.uci.edu 
refused while fetching http://www.ics.uci.edu/ (link found in doc #0)
ERROR [Crawler 2] Fatal transport error: Connection to http://www.ics.uci.edu 
refused while fetching http://www.ics.uci.edu/~lopes/ (link found in doc #0)
ERROR [Crawler 2] Fatal transport error: Connection to http://www.ics.uci.edu 
refused while fetching http://www.ics.uci.edu/~welling/ (link found in doc #0)

I am using the crawler4j version 3.5.I am trying to run the basic crawler 
example.
How to resolve this issue?

Original comment by kritikar...@gmail.com on 21 May 2014 at 6:54

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Hi,

I've got the same problem (I'm using a proxy):

SCHWERWIEGEND: Fatal transport error: http://myproxy.com while fetching 
http://www.nduoa.com/cat0 (link found in doc #0)
Non success status for link: http://www.nduoa.com/cat1, status code: 1005, 
description: Fatal transport error

I've already setted politenessDelay up to 100000 but it does not help.

Please help us

Original comment by maren.su...@gmail.com on 8 Jul 2014 at 8:05

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:34

GoogleCodeExporter commented 9 years ago
Fatal transport error - means that there was an IO exception - site down ?  
Site declines the crawler?

These are the links people have found with problems:
http://www.patentstorm.us/robots.txt -- Works for me, and anyway it is a 
redirect as the site got down.

https://www.stg.nylinvestments.com/mainstay -- I also get the same error but 
the server is really down - i tried with my browser to the same results

All of the http://www.ics.uci.edu work for me

http://www.nduoa.com/cat0 -- works for me, although I didn't use a proxy
http://www.nduoa.com/cat1 -- works for me, although I didn't use a proxy

If any one of you wants, you can recheck the problems you had and if they still 
occur we can solve them together

Original comment by avrah...@gmail.com on 20 Aug 2014 at 2:40

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 23 Sep 2014 at 2:06

GoogleCodeExporter commented 9 years ago
Deleting content of: C:\data\crawl\root\frontier
ERROR [main] Fatal transport error: Connection to http://www.ics.uci.edu 
refused while fetching http://www.ics.uci.edu/robots.txt (link found in doc #0)
 INFO [main] Crawler 1 started.
ERROR [Crawler 1] Fatal transport error: Connection to http://www.ics.uci.edu 
refused while fetching http://www.ics.uci.edu/ (link found in doc #0)
 INFO [Thread-1] It looks like no thread is working, waiting for 10 seconds to make sure...
 INFO [Thread-1] No thread is working and no more URLs are in queue waiting for another 10 seconds to make sure...
 INFO [Thread-1] All of the crawlers are stopped. Finishing the process...
 INFO [Thread-1] Waiting for 10 seconds before final clean up...

still i'm facing this issue 

Original comment by Nitheshk...@gmail.com on 23 Feb 2015 at 11:39