ClosedPoolError on crawling

eladbitton commented 7 years ago

I am building a crawler with python3 and urllib3. I am using a PoolManager instance that is used by 15 different threads. While crawling thousands of website i get a lot of ClosedPoolError from different website.

On the documentation - ClosedPoolError:

Raised when a request enters a pool after the pool has been closed.

It appears that the PoolManager instance is trying to use a closed connection.

My code:

from urllib3 import PoolManager, util, Retry
from urllib3.exceptions import MaxRetryError

# Instance of PoolManager is started on init
manager = PoolManager(num_pools=15,
                      maxsize=6,
                      timeout=40.0,
                      retries=Retry(connect=2, read=2, redirect=10))

# Every thread execute download by using the pool manager instance
url_to_download = "**"
headers = util.make_headers(accept_encoding='gzip, deflate',
                            keep_alive=True,
                            user_agent="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0")
headers['Accept-Language'] = "en-US,en;q=0.5"
headers['Accept'] = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
try:
   response = manager.request('GET',
                                   url_to_download,
                                   preload_content=False,
                                   headers=headers)
except MaxRetryError as ex:
   raise FailedToDownload()

How can i make the PoolManager renew the connection and try again?

shazow commented 7 years ago

Lukasa commented 7 years ago

Do you have a complete traceback, please? I'd like to see where this failure is coming from.

eladbitton commented 7 years ago

This is what i have from my log: <class 'urllib3.exceptions.ClosedPoolError'>, ClosedPoolError("HTTPConnectionPool(host='homepage.ntlworld.com', port=80): Pool is closed.",), <traceback object at 0x7f01db7462c8>

Lukasa commented 7 years ago

Hrm, we're really going to need a better traceback. If you're not catching this, can you add a block that catches it and fires log.exception to provide the traceback in the logs?

eladbitton commented 7 years ago

Full traceback :)

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 235, in _get_conn
    conn = self.pool.get(block=self.block, timeout=timeout)
AttributeError: 'NoneType' object has no attribute 'get'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/elad/workspace/Crawler/crawler/Downloader.py", line 62, in download
    headers=headers)
  File "/usr/local/lib/python3.5/dist-packages/urllib3/request.py", line 69, in request
    **urlopen_kw)
  File "/usr/local/lib/python3.5/dist-packages/urllib3/request.py", line 90, in request_encode_url
    return self.urlopen(method, url, **extra_kw)
  File "/usr/local/lib/python3.5/dist-packages/urllib3/poolmanager.py", line 248, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 668, in urlopen
    release_conn=release_conn, **response_kw)
  File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 583, in urlopen
    conn = self._get_conn(timeout=pool_timeout)
  File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 238, in _get_conn
    raise ClosedPoolError(self, "Pool is closed.")
urllib3.exceptions.ClosedPoolError: HTTPConnectionPool(host='reader.co.il', port=80): Pool is closed.

Lukasa commented 7 years ago

So, that's convenient. It seems like we've followed a redirect and are now attempting to grab a connection out of the connection pool. Somehow, in between this, the pool is getting closed.

It looks like the PoolManager allows this behaviour: essentially, if a pool starts making requests to the host after num_pools previous hosts, that can cause the oldest outstanding connection pool to be forcefully closed. This won't affect outstanding requests, but will affect redirects.

What's not clear to me is why there is redirect code in multiple places. If we're capable of doing redirects at the PoolManager level, why do the ConnectionPools also support it? @haikuginger, @shazow, got theories on that?

haikuginger commented 7 years ago

If we're capable of doing redirects at the PoolManager level, why do the ConnectionPools also support it?

Because ConnectionPools were originally and still are a first-class object that can be used on their own for a single host. This was a cause for confusion for me as well, and is one of a number of reasons that redirects are refactored to be part of the RequestMethods base class as part of the SessionManager branch.

shazow commented 7 years ago

What's not clear to me is why there is redirect code in multiple places. If we're capable of doing redirects at the PoolManager level, why do the ConnectionPools also support it?

Historically, ConnectionPools came first, and the first attempt at redirection code lived there. Of course it did not work for cross-host requests.

Then PoolManagers came, and the redirection code was ~copied, then later improved by the Retry objects stuff.

I'd be +1 from removing redirect support from ConnectionPools, as it's very edge-case-y anyways (since it breaks for cross-host). I feel if you're going to use such a low-level primitive, you should be ready to do your own redirecting.

Bonus points if our redirecting code can be used in stand-alone for those who want to. This would be a great recipe to have in docs, if it's not there already.

Lukasa commented 7 years ago

I'd be +1 on pulling it out of the ConnectionPool too.

haikuginger commented 7 years ago

I have no objection to that.

sigmavirus24 commented 7 years ago

I'm working on this since it's also necessary for #952.

eladbitton commented 7 years ago

When will this fix will get in the main repository?

Lukasa commented 7 years ago

When the fix is finished. =)

eladbitton commented 7 years ago

Any projection for that? :)

Lukasa commented 7 years ago

Nope. The nature of these things is that without any form of contracted work the work will get done when someone is motivated sufficiently to complete that contribution.

sigmavirus24 commented 7 years ago

@eladbitton posting a bounty on this bug is a good way to motivate people to fix the bug for you.

Yomguithereal commented 5 years ago

Hitting the same issue while using a bunch of threads to mass download html content. Is there anything I could do to help you fix this or do you reckon I should handle redirects on my own as @shazow says:

I feel if you're going to use such a low-level primitive, you should be ready to do your own redirecting.

willianantunes commented 5 years ago

Hitting the same issue here as well, but in my case I'm usings requests which wraps it.

@Lukasa, @haikuginger and @shazow I'm really aiming to get this solved. I'm studying how I can submit a PR with tested code and so on to get this fixed and help the community.

If you guys have any tips, just leave them on the table 😄 I'll see what I can do 👍

willianantunes commented 5 years ago

One possible way to circumvent this, is to create a code like:

def _retry_if_exception_is_caught(execution_to_be_callable: Callable):
    max_tries = 5
    tries = 0
    while True:
        try:
            response = execution_to_be_callable()
            return response
        except Exception as e:
            logger.warning(f"Exception of type {type(e)} was caught. Trying again...")
        if tries >= max_tries:
            raise CouldNotAccomplishExecutionException()
        tries += 1

Then you can use it like this:

def do_the_call():
    with requests.Session() as session:
        return session.get("http://your-service-url")

foo = _retry_if_exception_is_caught(lambda: do_the_call())

For those who is facing the same problem.

marcobazzani commented 3 years ago

so digging into this found that even if blocking is False it fails with this error:


Traceback (most recent call last):
  File "/home/mbazzan/project/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/home/mbazzan/project/site-packages/urllib3/connectionpool.py", line 660, in urlopen
    conn = self._get_conn(timeout=pool_timeout)
  File "/home/mbazzan/project/site-packages/urllib3/connectionpool.py", line 259, in _get_conn
    raise ClosedPoolError(self, f"Pool is closed. {self.block} {timeout}")
urllib3.exceptions.ClosedPoolError: HTTPSConnectionPool(host='zzz.zzz-api.us-east-1.amazonaws.com', port=443): Pool is closed.

while the doc says:

    If no connections are available and :prop:`.block` is ``False``, then a
    fresh connection is returned.

marcobazzani commented 3 years ago

I ended up by patching _get_conn like this:


def _get_conn(self, timeout=None):
    conn = None
    try:
        conn = self.pool.get(block=self.block, timeout=timeout)

    except AttributeError:  # self.pool is None
        return self._new_conn()

    except queue.Empty:
        if self.block:
            raise EmptyPoolError(
                self, "Pool reached maximum size and no more connections are allowed.",
            )
        pass

    if conn and is_connection_dropped(conn):
        log.debug("Resetting dropped connection: %s", self.host)
        conn.close()
        if getattr(conn, "auto_open", 1) == 0:
            conn = None

    return conn or self._new_conn()

connectionpool.HTTPConnectionPool._get_conn = _get_conn

it works but I don't know the drawbacks, any help here?

xloem commented 1 year ago

@eladbitton posting a bounty on this bug is a good way to motivate people to fix the bug for you.

How large do bounties usually need to get before somebody bites?

urllib3 / urllib3

ClosedPoolError on crawling #951