jesbin / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Fatal Transport Error: Read timeout while fetching from same host multiple times #253

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Write some proxy service that uses proxies to send http requests and return 
back the same request
2. Make crawler4j use this service for all crawls
3. After 1 or 2 requests, it will start giving error  Fatal transport error: 
Read timed out while fetching http://...

What is the expected output? What do you see instead?
The expected output is not seeing the error message:
Fatal transport error: Read timed out while fetching http://...

What version of the product are you using?
3.3

Please provide any additional information below.
The problem goes away when I stop using my proxy service. My proxy service is a 
web service that has one endpoint that takes a url, and returns the same http 
response you'd get if you make a request to the given url normally. In the 
background, it is managing lots of proxies to figure out which one to use to 
make the http request. Since the problem goes away if i just do normal http 
requests (i.e. request www.google.com instead of 
www.myproxyservice.com/content?url=http://www.google.com), I'm guessing there 
is something in crawler4j that rate limits or blocks http requests to the same 
host after a while? Can someone help me with this? 

Original issue reported on code.google.com by m...@groupon.com on 3 Feb 2014 at 10:23

GoogleCodeExporter commented 8 years ago
nvm, please delete this issue. The problem is due to a program I was running in 
the background called sshuttle that was messing with the requests.

Original comment by m...@groupon.com on 3 Feb 2014 at 10:54

GoogleCodeExporter commented 8 years ago

Original comment by avrah...@gmail.com on 17 Aug 2014 at 5:45

GoogleCodeExporter commented 8 years ago
I'm facing the same error. Is it because the site I'm crawling is behind a 
proxy server? How should I solve this problem?

Original comment by lallian....@utsavfashion.com on 20 Oct 2014 at 11:43

GoogleCodeExporter commented 8 years ago
What is your exact scenario ?

Original comment by avrah...@gmail.com on 20 Oct 2014 at 1:54

GoogleCodeExporter commented 8 years ago
No scenario, tagged as invalid

Original comment by avrah...@gmail.com on 22 Jan 2015 at 11:44

GoogleCodeExporter commented 8 years ago

Original comment by avrah...@gmail.com on 22 Jan 2015 at 11:44