Broken connections in pool become busy indefinitely

7opf commented 10 months ago

What versions are you using?

Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production Version 19.16.0.0.0

DEBUG:    platform.platform: Linux-5.15.0-1042-azure-x86_64-with-glibc2.35
DEBUG:    sys.maxsize > 2**32: True
DEBUG:    platform.python_version: 3.10.9
DEBUG:    oracledb.__version__: 1.4.0

Is it an error or a hang or a crash? Hang (acquiring a connection is not possible if all connections are busy)
What error(s) or behavior you are seeing?

Some network interruptions cause pool connection(s) to remain in busy list indefinitely. Unfortunately I was unable to simulate the network interruption that reproduces the issue.

Does your application call init_oracle_client()?

No

Include a runnable Python script that shows the problem.

Following call is made every 10s:

await oracleStorage.ping()

class AsyncHelper:
    @staticmethod
    async def execute_io_bound(func: Callable[..., T], *args, **kwargs) -> T:
        """Execute an io-bound (sync) function in a separate thread pool so it does not block the asyncio thread."""

        loop = asyncio.get_running_loop()
        # default asyncio executor is a ThreadPoolExecutor, appropriate for io-bound tasks like DB queries.
        return await loop.run_in_executor(None, functools.partial(func, *args, **kwargs))

class OracleStorage():
    def __init__(self):
      self._pool = None

    @property
    def pool(self) -> oracledb.ConnectionPool:
        if self._pool is None:
            self._connect()

        return self._pool

    def _connect(self):
        logging.debug(f"oracledb.__version__: {oracledb.version}")

        # Create a connection pool
        self._pool = oracledb.create_pool(
            connection_id_prefix="firewall-api_",
            dsn="...", # DSN with ewallet and failover options
            user="...",
            password="...",
            min=1,
            max=1,
            getmode=oracledb.POOL_GETMODE_TIMEDWAIT,
            wait_timeout=30000,
            purity=oracledb.PURITY_NEW,
            expire_time=1,
            timeout=900,
        )

        logging.info("Connected to Oracle DB")

    async def ping(self, timeout=900):
        return await AsyncHelper.execute_io_bound(self._ping_sync, timeout=timeout)

    def _ping_sync(self, timeout=900):
        with self.pool.acquire() as connection:
            prev_timeout = connection.call_timeout
            connection.call_timeout = timeout
            connection.ping()
            connection.call_timeout = prev_timeout

7opf commented 10 months ago

Normally what happens is:

Connection is acquired
Call ping()

Error is raised

DPY-4011: the database or network closed the connection

explicitly release (close) the connection, or use context manager - tried both
Connection remains in the list of busy connections
All connections are busy and so acquire() fails for all subsequent attempts

7opf commented 10 months ago

Solely looking at the source code, I think the error is raised here: https://github.com/oracle/python-oracledb/blob/e861b2bbdca5ddbde5ae293e973da552116d7037/src/oracledb/impl/thin/packet.pyx#L204-L206

Just before, the socket is closed, but I don't believe that is enough to close the connection (and return it to the pool / remove it from the busy list)

anthony-tuininga commented 10 months ago

So just to be clear: you are saying that you create the pool and populate it with connections successfully. After that some network glitch occurs and the ping fails and the connection is never released back to the pool? What is the full callstack of the exception? Can you replicate this easily? If so, does it fail without the use of asyncio? Without a call timeout? If you can supply the full callstack I might be able to simluate the error by adjusting the code to force an exception in that location even without the network glitch -- although that might be important. But let's start with the full callstack. Thanks!

cjbj commented 10 months ago

Are you using DRCP?
It's not clear if you are pinging by design, or to overcome the issue you are seeing. In general if you are going to call ping() after every acquire() you may as well remove the ping() and instead set ping_interval to 0, which will have identical behavior. However be aware that the extra round-trip by either solution impacts ultimate scalability, so consider carefully what your business requirements are. In general I'd say don't ping all the time (explicitly or with ping_interval=0), and instead make sure that application statement execution error recovery occurs - possibly via Application Continuity.

anthony-tuininga commented 7 months ago

Can you try the new asyncio support that was just announced in #258? If not, can you provide the callstack as requested earlier? Thanks!

7opf commented 7 months ago

We don't have time to investigate and have been using thick mode. Glad to hear asyncio support is around the corner - I may give it a shot at some point and report any issues I encounter.

7opf commented 2 months ago

Hi @anthony-tuininga and @cjbj

I have started using thin mode + asyncio and the issue persists.

This is the relevant part of the call stack:

   File "/app/products/firewall-api/image.binary.runfiles/panta/products/firewall-api/fwapi/services/oracle.py", line 721, in _fetchall
     async with self.pool.acquire() as con:
   File "/app/products/firewall-api/image.binary.runfiles/python3_global_deps_oracledb/site-packages/oracledb/connection.py", line 1433, in __aenter__
     await self._connect_coroutine
   File "/app/products/firewall-api/image.binary.runfiles/python3_global_deps_oracledb/site-packages/oracledb/connection.py", line 1475, in _connect
     impl = await pool_impl.acquire(params_impl)
   File "src/oracledb/impl/thin/pool.pyx", line 875, in acquire
   File "/app/products/firewall-api/image.binary.runfiles/python3_10_x86_64-unknown-linux-gnu/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
     return fut.result()
   File "src/oracledb/impl/thin/pool.pyx", line 721, in _acquire_helper
   File "src/oracledb/impl/thin/connection.pyx", line 557, in ping
   File "src/oracledb/impl/thin/connection.pyx", line 101, in oracledb.thin_impl.BaseThinConnImpl._create_message
   File "src/oracledb/impl/thin/messages.pyx", line 71, in oracledb.thin_impl.Message._initialize
   File "/app/products/firewall-api/image.binary.runfiles/python3_global_deps_oracledb/site-packages/oracledb/errors.py", line 181, in _raise_err
     raise error.exc_type(error) from cause
 oracledb.exceptions.InterfaceError: DPY-1001: not connected to database

What appears to happen is

The pool attempts to ping on a dead connection. Presumably it attempts to ping, because the connection has been idle for some time and it is checking the liveness of the connection now that it is being acquired. This checks out with what I have observed in an environment in which we have min=2, max=12 and there was a longer period of inactivity (no connections being acquired) before a connection acquisition was attempted.
The dead connection is added to the busy list for the pool - I could see from our metrics that busy count increased without any other activity.
The ping fails and InterfaceError is raised and admittedly not caught by my code (but this should be irrelevant)
The connection remains in the busy list indefinitely
This happens repeatedly over time until max=12 is reached and there are no more connections to give out => unhappy application.

Initially I thought I can just catch the error and close the dead connection which should remove it from the busy list, however conn would not be defined when using the context manager since __aenter__ doesn't return. Instead, something like this could work as a workaround?

        pool = oracledb.create_pool_async(...)
        conn = pool.acquire()

        try:
            conn = await conn
            # do stuff with conn
        finally:
            await pool.release(conn)

Shouldn't the pool be cleaning up this dead connection if it's unable to ping on it? And return another, working connection.

7opf commented 2 months ago

If the ping to the database indicates the connection is not alive a replacement connection will be returned by acquire().

https://python-oracledb.readthedocs.io/en/latest/api_manual/module.html

7opf commented 2 months ago

Further reproduction steps: it has something to do with how connections are recycled.

min=2, max=12

connection pool initialised (busy=0, open=0)
Attempt 1: successful acquisition of connection and query (busy=0, open=2)
Introduce network issue, in my case I blocked the IP of the server
attempt 2: failed acquisition of connection and query (busy=1, open=2)  note: during this step the connection acquisition was blocking and did not timeout despite POOL_GETMODE_TIMEDWAIT set to 10s
Cancel attempt 2 (busy=1, open=2)
Attempt 3: failed acquisition of connection and query (busy=2, open=2)  note: InterfaceError: DPY-1001: not connected to database raised
Attempt 4: failed acquisition of connection and query (busy=2, open=2)  note: asyncio.TimeoutError raised
Remove network issue
Attempt 5: (busy=2, open=3)  note: OperationalError: DPY-6005: cannot connect to database (…).
Attempt 6+: successful acquisition of connection and query (busy=2, open=3)

This was over a period of 10 minutes or so. The busy connections just stay there and once the the network issue reoccurs, it happens again eventually saturating the pool with busy connections that never become free.

anthony-tuininga commented 2 months ago

Looking at the code and your description I think this may be the source of the issue. Are you able to build from source and verify this for me?

--- a/src/oracledb/impl/thin/pool.pyx
+++ b/src/oracledb/impl/thin/pool.pyx
@@ -256,7 +256,6 @@ cdef class BaseThinPoolImpl(BasePoolImpl):
         Called when a new connection is created on acquire with the lock held.
         """
         if orig_conn_impl is not None:
-            self._busy_conn_impls.remove(orig_conn_impl)
             self._busy_conn_impls.append(new_conn_impl)
         else:
             new_conn_impl._is_pool_extra = True
@@ -621,7 +620,6 @@ cdef class ThinPoolImpl(BaseThinPoolImpl):
                     temp_conn_impl = None
                     break
                 temp_conn_impl = <ThinConnImpl> result
-                self._busy_conn_impls.append(temp_conn_impl)
                 if must_reconnect:
                     break

@@ -630,9 +628,11 @@ cdef class ThinPoolImpl(BaseThinPoolImpl):
             if requires_ping:
                 try:
                     temp_conn_impl.ping()
-                except exceptions.DatabaseError:
+                except exceptions.Error:
                     temp_conn_impl._force_close()
             if temp_conn_impl._protocol._transport is not None:
+                with self._condition:
+                    self._busy_conn_impls.append(temp_conn_impl)
                 return temp_conn_impl

         # a new connection needs to be created
@@ -710,7 +710,6 @@ cdef class AsyncThinPoolImpl(BaseThinPoolImpl):
                     temp_conn_impl = None
                     break
                 temp_conn_impl = <AsyncThinConnImpl> result
-                self._busy_conn_impls.append(temp_conn_impl)
                 if must_reconnect:
                     break

@@ -719,9 +718,11 @@ cdef class AsyncThinPoolImpl(BaseThinPoolImpl):
             if requires_ping:
                 try:
                     await temp_conn_impl.ping()
-                except exceptions.DatabaseError:
+                except exceptions.Error:
                     temp_conn_impl._force_close()
             if temp_conn_impl._protocol._transport is not None:
+                async with self._condition:
+                    self._busy_conn_impls.append(temp_conn_impl)
                 return temp_conn_impl

         # a new connection needs to be created

Essentially, if the ping fails AND the creation of the new connection fails, then the connection may remain in the busy list permanently. This patch only adds it to the busy list after it is known to be good. I don't have a good way to test this directly but hopefully you are able to use your existing test case to test this for me!

7opf commented 2 months ago

It appears to fix this issue. I could reproduce the issue on main, but not on main + patch. 👍

However, there is a separate issue which looking back at my recent comment was also present then (step 4).

acquire connection, (optionally) use it [busy=0 opened=0]
release it [busy=0 opened=1]
block server ip
wait for longer than ping_interval
attempt to acquire connection (and ping due to ping_interval) -> hangs indefinitely [busy=0 opened=1]
unblock server ip
asyncio TimeoutError [busy=0 opened=1] (note: busy count is 0 thanks to fix)

I suspect it's the ping that hangs indefinitely, because the connection should already be open and POOL_GETMODE_TIMEDWAIT is set which should also prevent a connection attempt to hang indefinitely.

I'll try to get a stack trace for this

7opf commented 2 months ago

Traceback for the other issue. Codebase = main + patch, Python 3.12.

Traceback (most recent call last):
  File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
  File "src/oracledb/impl/thin/pool.pyx", line 720, in _acquire_helper
    await temp_conn_impl.ping()
  File "src/oracledb/impl/thin/connection.pyx", line 558, in ping
    await protocol._process_single_message(message)
  File "src/oracledb/impl/thin/protocol.pyx", line 805, in _process_single_message
    async with self._request_lock:
  File "src/oracledb/impl/thin/protocol.pyx", line 806, in oracledb.thin_impl.BaseAsyncProtocol._process_single_message
    await self._process_message(message)
  File "src/oracledb/impl/thin/protocol.pyx", line 753, in _process_message
    raise
  File "src/oracledb/impl/thin/protocol.pyx", line 736, in oracledb.thin_impl.BaseAsyncProtocol._process_message
    await asyncio.wait_for(coroutine, timeout_obj)
  File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
  File "src/oracledb/impl/thin/protocol.pyx", line 792, in _process_message_helper
    await self._receive_packet(message, check_request_boundary=True)
  File "src/oracledb/impl/thin/protocol.pyx", line 828, in _receive_packet
    await buf.wait_for_packets_async()
  File "src/oracledb/impl/thin/packet.pyx", line 697, in wait_for_packets_async
    await self._waiter
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/ssk/code/python-oracledb/test.py", line 124, in main
    await x.ping_pool()
  File "/Users/ssk/code/python-oracledb/test.py", line 87, in ping_pool
    async with self.pool.acquire() as connection:
  File "/Users/ssk/code/python-oracledb/testev12/lib/python3.12/site-packages/oracledb-2.2.0b1-py3.12-macosx-14.0-arm64.egg/oracledb/connection.py", line 1433, in __aenter__
    await self._connect_coroutine
  File "/Users/ssk/code/python-oracledb/testev12/lib/python3.12/site-packages/oracledb-2.2.0b1-py3.12-macosx-14.0-arm64.egg/oracledb/connection.py", line 1475, in _connect
    impl = await pool_impl.acquire(params_impl)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/oracledb/impl/thin/pool.pyx", line 876, in acquire
    return await asyncio.wait_for(self._acquire_helper(params),
  File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/tasks.py", line 519, in wait_for
    async with timeouts.timeout(timeout):
  File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/timeouts.py", line 115, in __aexit__
    raise TimeoutError from exc_val
TimeoutError

7opf commented 2 months ago

Do you think you'll get the first bug fix into the upcoming release?

anthony-tuininga commented 2 months ago

Yes, this patch will be included in the upcoming release of 2.2. I plan to close this issue after that release has been completed. Thanks for your help in verifying the fix!

Regarding the second issue, the "hang" should not be indefinite as the underlying network will eventually give up and raise an exception. That could, however, be a rather long time! A separate enhancement is needed here -- the addition of pool_ping_timeout with a reasonable default. If you wish to create a separate issue for this additional feature, please go ahead and do so. That will be addressed in the next release.

7opf commented 2 months ago

Thank you!

anthony-tuininga commented 2 months ago

This was included in version 2.2.0 which was just released.

oracle / python-oracledb

Broken connections in pool become busy indefinitely #221