Open eladbitton opened 7 years ago
Related stackoverflow question here: https://stackoverflow.com/questions/39012081/python-urllib3-closedpoolerror-on-crawling
Do you have a complete traceback, please? I'd like to see where this failure is coming from.
This is what i have from my log:
<class 'urllib3.exceptions.ClosedPoolError'>, ClosedPoolError("HTTPConnectionPool(host='homepage.ntlworld.com', port=80): Pool is closed.",), <traceback object at 0x7f01db7462c8>
Hrm, we're really going to need a better traceback. If you're not catching this, can you add a block that catches it and fires log.exception
to provide the traceback in the logs?
Full traceback :)
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 235, in _get_conn
conn = self.pool.get(block=self.block, timeout=timeout)
AttributeError: 'NoneType' object has no attribute 'get'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/elad/workspace/Crawler/crawler/Downloader.py", line 62, in download
headers=headers)
File "/usr/local/lib/python3.5/dist-packages/urllib3/request.py", line 69, in request
**urlopen_kw)
File "/usr/local/lib/python3.5/dist-packages/urllib3/request.py", line 90, in request_encode_url
return self.urlopen(method, url, **extra_kw)
File "/usr/local/lib/python3.5/dist-packages/urllib3/poolmanager.py", line 248, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 668, in urlopen
release_conn=release_conn, **response_kw)
File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 583, in urlopen
conn = self._get_conn(timeout=pool_timeout)
File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 238, in _get_conn
raise ClosedPoolError(self, "Pool is closed.")
urllib3.exceptions.ClosedPoolError: HTTPConnectionPool(host='reader.co.il', port=80): Pool is closed.
So, that's convenient. It seems like we've followed a redirect and are now attempting to grab a connection out of the connection pool. Somehow, in between this, the pool is getting closed.
It looks like the PoolManager allows this behaviour: essentially, if a pool starts making requests to the host after num_pools
previous hosts, that can cause the oldest outstanding connection pool to be forcefully closed. This won't affect outstanding requests, but will affect redirects.
What's not clear to me is why there is redirect code in multiple places. If we're capable of doing redirects at the PoolManager level, why do the ConnectionPools also support it? @haikuginger, @shazow, got theories on that?
If we're capable of doing redirects at the PoolManager level, why do the ConnectionPools also support it?
Because ConnectionPools were originally and still are a first-class object that can be used on their own for a single host. This was a cause for confusion for me as well, and is one of a number of reasons that redirects are refactored to be part of the RequestMethods
base class as part of the SessionManager branch.
What's not clear to me is why there is redirect code in multiple places. If we're capable of doing redirects at the PoolManager level, why do the ConnectionPools also support it?
Historically, ConnectionPools came first, and the first attempt at redirection code lived there. Of course it did not work for cross-host requests.
Then PoolManagers came, and the redirection code was ~copied, then later improved by the Retry objects stuff.
I'd be +1 from removing redirect support from ConnectionPools, as it's very edge-case-y anyways (since it breaks for cross-host). I feel if you're going to use such a low-level primitive, you should be ready to do your own redirecting.
Bonus points if our redirecting code can be used in stand-alone for those who want to. This would be a great recipe to have in docs, if it's not there already.
I'd be +1 on pulling it out of the ConnectionPool too.
I have no objection to that.
I'm working on this since it's also necessary for #952.
When will this fix will get in the main repository?
When the fix is finished. =)
Any projection for that? :)
Nope. The nature of these things is that without any form of contracted work the work will get done when someone is motivated sufficiently to complete that contribution.
@eladbitton posting a bounty on this bug is a good way to motivate people to fix the bug for you.
Hitting the same issue while using a bunch of threads to mass download html content. Is there anything I could do to help you fix this or do you reckon I should handle redirects on my own as @shazow says:
I feel if you're going to use such a low-level primitive, you should be ready to do your own redirecting.
Hitting the same issue here as well, but in my case I'm usings requests
which wraps it.
@Lukasa, @haikuginger and @shazow I'm really aiming to get this solved. I'm studying how I can submit a PR with tested code and so on to get this fixed and help the community.
If you guys have any tips, just leave them on the table 😄 I'll see what I can do 👍
One possible way to circumvent this, is to create a code like:
def _retry_if_exception_is_caught(execution_to_be_callable: Callable):
max_tries = 5
tries = 0
while True:
try:
response = execution_to_be_callable()
return response
except Exception as e:
logger.warning(f"Exception of type {type(e)} was caught. Trying again...")
if tries >= max_tries:
raise CouldNotAccomplishExecutionException()
tries += 1
Then you can use it like this:
def do_the_call():
with requests.Session() as session:
return session.get("http://your-service-url")
foo = _retry_if_exception_is_caught(lambda: do_the_call())
For those who is facing the same problem.
so digging into this found that even if blocking is False it fails with this error:
Traceback (most recent call last):
File "/home/mbazzan/project/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/home/mbazzan/project/site-packages/urllib3/connectionpool.py", line 660, in urlopen
conn = self._get_conn(timeout=pool_timeout)
File "/home/mbazzan/project/site-packages/urllib3/connectionpool.py", line 259, in _get_conn
raise ClosedPoolError(self, f"Pool is closed. {self.block} {timeout}")
urllib3.exceptions.ClosedPoolError: HTTPSConnectionPool(host='zzz.zzz-api.us-east-1.amazonaws.com', port=443): Pool is closed.
while the doc says:
If no connections are available and :prop:`.block` is ``False``, then a
fresh connection is returned.
I ended up by patching _get_conn like this:
def _get_conn(self, timeout=None):
conn = None
try:
conn = self.pool.get(block=self.block, timeout=timeout)
except AttributeError: # self.pool is None
return self._new_conn()
except queue.Empty:
if self.block:
raise EmptyPoolError(
self, "Pool reached maximum size and no more connections are allowed.",
)
pass
if conn and is_connection_dropped(conn):
log.debug("Resetting dropped connection: %s", self.host)
conn.close()
if getattr(conn, "auto_open", 1) == 0:
conn = None
return conn or self._new_conn()
connectionpool.HTTPConnectionPool._get_conn = _get_conn
it works but I don't know the drawbacks, any help here?
@eladbitton posting a bounty on this bug is a good way to motivate people to fix the bug for you.
How large do bounties usually need to get before somebody bites?
I am building a crawler with python3 and urllib3. I am using a PoolManager instance that is used by 15 different threads. While crawling thousands of website i get a lot of ClosedPoolError from different website.
On the documentation - ClosedPoolError:
It appears that the PoolManager instance is trying to use a closed connection.
My code:
How can i make the PoolManager renew the connection and try again?