Closed 7opf closed 2 months ago
Normally what happens is:
ping()
DPY-4011: the database or network closed the connection
acquire()
fails for all subsequent attemptsSolely looking at the source code, I think the error is raised here: https://github.com/oracle/python-oracledb/blob/e861b2bbdca5ddbde5ae293e973da552116d7037/src/oracledb/impl/thin/packet.pyx#L204-L206
Just before, the socket is closed, but I don't believe that is enough to close the connection (and return it to the pool / remove it from the busy list)
So just to be clear: you are saying that you create the pool and populate it with connections successfully. After that some network glitch occurs and the ping fails and the connection is never released back to the pool? What is the full callstack of the exception? Can you replicate this easily? If so, does it fail without the use of asyncio? Without a call timeout? If you can supply the full callstack I might be able to simluate the error by adjusting the code to force an exception in that location even without the network glitch -- although that might be important. But let's start with the full callstack. Thanks!
ping()
after every acquire()
you may as well remove the ping()
and instead set ping_interval
to 0, which will have identical behavior. However be aware that the extra round-trip by either solution impacts ultimate scalability, so consider carefully what your business requirements are. In general I'd say don't ping all the time (explicitly or with ping_interval=0), and instead make sure that application statement execution error recovery occurs - possibly via Application Continuity. Can you try the new asyncio support that was just announced in #258? If not, can you provide the callstack as requested earlier? Thanks!
We don't have time to investigate and have been using thick mode. Glad to hear asyncio support is around the corner - I may give it a shot at some point and report any issues I encounter.
Hi @anthony-tuininga and @cjbj
I have started using thin mode + asyncio and the issue persists.
This is the relevant part of the call stack:
File "/app/products/firewall-api/image.binary.runfiles/panta/products/firewall-api/fwapi/services/oracle.py", line 721, in _fetchall
async with self.pool.acquire() as con:
File "/app/products/firewall-api/image.binary.runfiles/python3_global_deps_oracledb/site-packages/oracledb/connection.py", line 1433, in __aenter__
await self._connect_coroutine
File "/app/products/firewall-api/image.binary.runfiles/python3_global_deps_oracledb/site-packages/oracledb/connection.py", line 1475, in _connect
impl = await pool_impl.acquire(params_impl)
File "src/oracledb/impl/thin/pool.pyx", line 875, in acquire
File "/app/products/firewall-api/image.binary.runfiles/python3_10_x86_64-unknown-linux-gnu/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
return fut.result()
File "src/oracledb/impl/thin/pool.pyx", line 721, in _acquire_helper
File "src/oracledb/impl/thin/connection.pyx", line 557, in ping
File "src/oracledb/impl/thin/connection.pyx", line 101, in oracledb.thin_impl.BaseThinConnImpl._create_message
File "src/oracledb/impl/thin/messages.pyx", line 71, in oracledb.thin_impl.Message._initialize
File "/app/products/firewall-api/image.binary.runfiles/python3_global_deps_oracledb/site-packages/oracledb/errors.py", line 181, in _raise_err
raise error.exc_type(error) from cause
oracledb.exceptions.InterfaceError: DPY-1001: not connected to database
What appears to happen is
min=2, max=12
and there was a longer period of inactivity (no connections being acquired) before a connection acquisition was attempted. InterfaceError
is raised and admittedly not caught by my code (but this should be irrelevant)max=12
is reached and there are no more connections to give out => unhappy application.Initially I thought I can just catch the error and close the dead connection which should remove it from the busy list, however conn
would not be defined when using the context manager since __aenter__
doesn't return. Instead, something like this could work as a workaround?
pool = oracledb.create_pool_async(...)
conn = pool.acquire()
try:
conn = await conn
# do stuff with conn
finally:
await pool.release(conn)
Shouldn't the pool be cleaning up this dead connection if it's unable to ping on it? And return another, working connection.
If the ping to the database indicates the connection is not alive a replacement connection will be returned by acquire().
https://python-oracledb.readthedocs.io/en/latest/api_manual/module.html
Further reproduction steps: it has something to do with how connections are recycled.
min=2, max=12
This was over a period of 10 minutes or so. The busy connections just stay there and once the the network issue reoccurs, it happens again eventually saturating the pool with busy connections that never become free.
Looking at the code and your description I think this may be the source of the issue. Are you able to build from source and verify this for me?
--- a/src/oracledb/impl/thin/pool.pyx
+++ b/src/oracledb/impl/thin/pool.pyx
@@ -256,7 +256,6 @@ cdef class BaseThinPoolImpl(BasePoolImpl):
Called when a new connection is created on acquire with the lock held.
"""
if orig_conn_impl is not None:
- self._busy_conn_impls.remove(orig_conn_impl)
self._busy_conn_impls.append(new_conn_impl)
else:
new_conn_impl._is_pool_extra = True
@@ -621,7 +620,6 @@ cdef class ThinPoolImpl(BaseThinPoolImpl):
temp_conn_impl = None
break
temp_conn_impl = <ThinConnImpl> result
- self._busy_conn_impls.append(temp_conn_impl)
if must_reconnect:
break
@@ -630,9 +628,11 @@ cdef class ThinPoolImpl(BaseThinPoolImpl):
if requires_ping:
try:
temp_conn_impl.ping()
- except exceptions.DatabaseError:
+ except exceptions.Error:
temp_conn_impl._force_close()
if temp_conn_impl._protocol._transport is not None:
+ with self._condition:
+ self._busy_conn_impls.append(temp_conn_impl)
return temp_conn_impl
# a new connection needs to be created
@@ -710,7 +710,6 @@ cdef class AsyncThinPoolImpl(BaseThinPoolImpl):
temp_conn_impl = None
break
temp_conn_impl = <AsyncThinConnImpl> result
- self._busy_conn_impls.append(temp_conn_impl)
if must_reconnect:
break
@@ -719,9 +718,11 @@ cdef class AsyncThinPoolImpl(BaseThinPoolImpl):
if requires_ping:
try:
await temp_conn_impl.ping()
- except exceptions.DatabaseError:
+ except exceptions.Error:
temp_conn_impl._force_close()
if temp_conn_impl._protocol._transport is not None:
+ async with self._condition:
+ self._busy_conn_impls.append(temp_conn_impl)
return temp_conn_impl
# a new connection needs to be created
Essentially, if the ping fails AND the creation of the new connection fails, then the connection may remain in the busy list permanently. This patch only adds it to the busy list after it is known to be good. I don't have a good way to test this directly but hopefully you are able to use your existing test case to test this for me!
It appears to fix this issue. I could reproduce the issue on main, but not on main + patch. 👍
However, there is a separate issue which looking back at my recent comment was also present then (step 4).
I suspect it's the ping that hangs indefinitely, because the connection should already be open and POOL_GETMODE_TIMEDWAIT is set which should also prevent a connection attempt to hang indefinitely.
I'll try to get a stack trace for this
Traceback for the other issue. Codebase = main + patch, Python 3.12.
Traceback (most recent call last):
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
return await fut
^^^^^^^^^
File "src/oracledb/impl/thin/pool.pyx", line 720, in _acquire_helper
await temp_conn_impl.ping()
File "src/oracledb/impl/thin/connection.pyx", line 558, in ping
await protocol._process_single_message(message)
File "src/oracledb/impl/thin/protocol.pyx", line 805, in _process_single_message
async with self._request_lock:
File "src/oracledb/impl/thin/protocol.pyx", line 806, in oracledb.thin_impl.BaseAsyncProtocol._process_single_message
await self._process_message(message)
File "src/oracledb/impl/thin/protocol.pyx", line 753, in _process_message
raise
File "src/oracledb/impl/thin/protocol.pyx", line 736, in oracledb.thin_impl.BaseAsyncProtocol._process_message
await asyncio.wait_for(coroutine, timeout_obj)
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
return await fut
^^^^^^^^^
File "src/oracledb/impl/thin/protocol.pyx", line 792, in _process_message_helper
await self._receive_packet(message, check_request_boundary=True)
File "src/oracledb/impl/thin/protocol.pyx", line 828, in _receive_packet
await buf.wait_for_packets_async()
File "src/oracledb/impl/thin/packet.pyx", line 697, in wait_for_packets_async
await self._waiter
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/ssk/code/python-oracledb/test.py", line 124, in main
await x.ping_pool()
File "/Users/ssk/code/python-oracledb/test.py", line 87, in ping_pool
async with self.pool.acquire() as connection:
File "/Users/ssk/code/python-oracledb/testev12/lib/python3.12/site-packages/oracledb-2.2.0b1-py3.12-macosx-14.0-arm64.egg/oracledb/connection.py", line 1433, in __aenter__
await self._connect_coroutine
File "/Users/ssk/code/python-oracledb/testev12/lib/python3.12/site-packages/oracledb-2.2.0b1-py3.12-macosx-14.0-arm64.egg/oracledb/connection.py", line 1475, in _connect
impl = await pool_impl.acquire(params_impl)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "src/oracledb/impl/thin/pool.pyx", line 876, in acquire
return await asyncio.wait_for(self._acquire_helper(params),
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/tasks.py", line 519, in wait_for
async with timeouts.timeout(timeout):
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/timeouts.py", line 115, in __aexit__
raise TimeoutError from exc_val
TimeoutError
Do you think you'll get the first bug fix into the upcoming release?
Yes, this patch will be included in the upcoming release of 2.2. I plan to close this issue after that release has been completed. Thanks for your help in verifying the fix!
Regarding the second issue, the "hang" should not be indefinite as the underlying network will eventually give up and raise an exception. That could, however, be a rather long time! A separate enhancement is needed here -- the addition of pool_ping_timeout
with a reasonable default. If you wish to create a separate issue for this additional feature, please go ahead and do so. That will be addressed in the next release.
Thank you!
This was included in version 2.2.0 which was just released.
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production Version 19.16.0.0.0
Is it an error or a hang or a crash? Hang (acquiring a connection is not possible if all connections are busy)
What error(s) or behavior you are seeing?
Some network interruptions cause pool connection(s) to remain in busy list indefinitely. Unfortunately I was unable to simulate the network interruption that reproduces the issue.
No
Following call is made every 10s: