Better handling of failed connections inside a pool

rra commented 1 year ago

We're using bonsai with AIOConnectionPool and AIOPoolContextManager and are seeing some tricky state-tracking issues with failed connections. I'm not sure I entirely understand the problem, let alone the best solution, but wanted to open an issue to start a discussion.

The root problem is that we have a long-running service that wants to maintain an open LDAP connection pool to save on reconnect overhead under load, but sometimes the service is idle for long enough that firewalls between the service and the LDAP server time out the connections. When this happens, attempting to query the connection just never returns a result unless a timeout is set (presumably it would eventually fail at the TCP timeout, but that's often quite long).

We work around this by setting a timeout on the search request, and that works fine (it feels from local testing like something down inside bonsai may be busy-waiting on the search result in an otherwise idle asyncio pool, but I haven't been able to track that down and it's not really a problem, just an aside). However, when the connection does time out, the bonsai pool doesn't naturally recover.

In other words, I was hoping I could write some code like this:

async with AIOPoolContextManager(self._pool) as conn:
    try:
        return await conn.search(
            base=base,
            scope=scope,
            filter_exp=filter_exp,
            attrlist=attrlist,
            timeout=LDAP_TIMEOUT,
        )
    except bonsai.LDAPError as e:
        logger.error("Cannot query LDAP", error=str(e))
        raise LDAPError("Error querying LDAP") from e

and have the right thing happen, but what does happen is that the failed connection is returned to the pool and then handed back to a later caller and of course will still fail. What I expected instead was for AIOPoolContextManager in its __aexit__ method to notice that it was exiting with an exception indicating the connection is bad (which may be a bonsai error or asyncio.TimeoutError) and mark the connection as invalid so that it will be closed and reopened, rather than returning the failed connection to the pool.

Currently, it appears that what I have to do is close and reopen the connection myself before allowing it to be returned to the pool to maintain the invariant that all connections in the pool are valid. In other words, adding another except block like:

    except (bonsai.ConnectionError, asyncio.TimeoutError):
        logger.debug("Reopening LDAP connection after timeout")
        conn.close()
        await conn.open(timeout=LDAP_TIMEOUT)

This does work, but of course has the problem that conn.open may fail, at which point a closed connection is returned to the pool. So what I currently have is the following, which I think will work but which is kind of a mess:

try:
    for _ in range(2):
        async with AIOPoolContextManager(self._pool) as conn:
            try:
                logger.debug("Querying LDAP")
                return await conn.search(
                    base=base,
                    scope=scope,
                    filter_exp=filter_exp,
                    attrlist=attrlist,
                    timeout=LDAP_TIMEOUT,
                )
            except (bonsai.ConnectionError, asyncio.TimeoutError):
                logger.debug("Reopening LDAP connection after timeout")
                conn.close()
                await conn.open(timeout=LDAP_TIMEOUT)
            except bonsai.ConnectionClosed:
                # If we timed out and then the open failed, a closed
                # connection is left in the pool and returned later,
                # at which point it fails with this exception.
                # Attempt to reopen it.
                logger.debug("Attempting to reopen closed connection")
                await conn.open(timeout=LDAP_TIMEOUT)
except bonsai.LDAPError as e:
    logger.error("Cannot query LDAP", error=str(e))
    raise LDAPError("Error querying LDAP") from e

# Failed twice with a timeout.
msg = f"LDAP query timed out after {LDAP_TIMEOUT}s"
logger.error("Cannot query LDAP", error=msg)
raise LDAPError(msg)

This catches the error from a closed connection and tries to reopen it. But it feels like this is working around state tracking issues that ideally should be handled in the pool, and instead the pool should have a list of failed connections that are reopened before they're returned again.

Does this all make sense? I'm happy to try to put together a PR if the solution makes sense, but I'm not sure I'm thinking about the problem in the right way.

noirello commented 1 year ago

Yes, it makes sense.

I'm not sure that there's an easy way to detect corrupt connections. ConnectionError might seem like an obvious sign, but I'm not that confident about timeout errors.

One change I'm considering right now is the idea of not putting back closed connections to the pool. I can't think of any reason to keeping them in cycle. It'd be a relatively simple change in the put method, because the connection object already has an attribute about its closed state (While getting the last exception would require much more work). Thus closing a corrupt connection would give you the possibility to implicitly remove it from the pool. So by writing something like this, you could be sure not getting back the invalid connection:

async with AIOPoolContextManager(self._pool) as conn:
    try:
        logger.debug("Querying LDAP")
        return await conn.search(
            base=base,
            scope=scope,
            filter_exp=filter_exp,
            attrlist=attrlist,
            timeout=LDAP_TIMEOUT,
        )
    except (bonsai.ConnectionError, asyncio.TimeoutError):
        conn.close()

rra commented 1 year ago

Oh, this is a great idea. I think this is a much better solution than what I was thinking of, since it avoids encoding the logic for closing and reopening connections into bonsai, which is not obvious and may not be correct for every use case. I can keep the logic about when to close the connection and then the pool manager will do the right thing. I can put together a PR if you'd like; looks like get already has the logic to grow the pool again if needed, so yes, this should be very simple.

noirello commented 1 year ago

A PR would be great, thank you.

noirello / bonsai

Better handling of failed connections inside a pool #64