oracledb.connect_async hangs forever

kisv-SOPTIM commented 2 months ago

What versions are you using?

python-oracledb 2.1.1 python 3.9.16

Is it an error or a hang or a crash?

hang

What error(s) or behavior you are seeing?

Calling oracledb.connect_async hangs forever. We only experience this issue in two of three (identical) environments, the third one works normally. Other software components can connect to the database in the affected environment normally.

Please see the following log: https://gist.github.com/kisv-SOPTIM/18d91a699f6050dbadfd6da50bc12407

Does your application call init_oracle_client()?

no

Any hint regarding a possible cause is highly appreciated.

anthony-tuininga commented 2 months ago

One possibility: the database you are connecting to has NNE enabled. Can you check with the queries mentioned in this link? It is possible that the code for asyncio isn't addressing that correctly.

kisv-SOPTIM commented 2 months ago

Thanks for the quick response!

NNE is not enabled.

cjbj commented 2 months ago

Can you set PYO_DEBUG_PACKETS and provide some traces to @anthony-tuininga ? See https://python-oracledb.readthedocs.io/en/latest/user_guide/tracing.html#low-level-python-oracledb-driver-tracing and various other references in Issues and Discussions.

kisv-SOPTIM commented 2 months ago

Can you set PYO_DEBUG_PACKETS and provide some traces to @anthony-tuininga ? See https://python-oracledb.readthedocs.io/en/latest/user_guide/tracing.html#low-level-python-oracledb-driver-tracing and various other references in Issues and Discussions.

The gist linked in my first comment already contains a log excerpt with PYO_DEBUG_PACKETS enabled. Do you require any further debugging information?

Please note that the stack trace at the end of the log is caused by canceling the task externally in an attempt to work around the issue.

anthony-tuininga commented 2 months ago

A few more questions, the answers to which may be helpful:

do you have any difficulty connecting using a regular (not asyncio) connection?
does the failure occur consistently in this environment? Or is it intermittent?
which "other software components" connect normally?
what database versions are being used?
are you familiar with getting server traces? It looks like the server is not responding (the wait is occurring right after the handoff from the listener to the database)
are you familiar with Wireshark? You can see whether or not the server is responding but perhaps not responding the way the driver is expecting
have you tried with thick mode?

kisv-SOPTIM commented 2 months ago

Some general context first: We only experience this problem in our staging and production environment, but neither in our identical test environment, nor in our development environments. The affected software component is part of a project in a highly regulated industry. We only act as the software provider for the project, and the infrastructure, including the database server, is managed by a third party. We only have very limited access to the staging and prod environments, and every change we want to deploy there requires a very long lead time, which makes low level debugging nearly impossible.

do you have any difficulty connecting using a regular (not asyncio) connection?

We originally used the non async connection variant before switching to async, and expirienced no problems with it. We cant easily verify that this would still work today tho.

does the failure occur consistently in this environment? Or is it intermittent?

The failure occurs consistently in the two affected environments, even with different users/schemas

which "other software components" connect normally?

We have about 40 services written in java which can connect normally

what database versions are being used?

select BANNER_FULL from v$version Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production Version 19.11.0.0.0

The database is an active/standby dataguard cluster. Our python application tries to connect to each of the two nodes separately, to determine their status. We get an (expected) ORA-12514 error from the inactive one, and experience the hang problem on the active one.

are you familiar with getting server traces? It looks like the server is not responding (the wait is occurring right after the handoff from the listener to the database)

No, because of our situation described above, I don't think we will get access to them.

are you familiar with Wireshark? You can see whether or not the server is responding but perhaps not responding the way the driver is expecting

We can't easily perform packet captures in these environments.

have you tried with thick mode?

No

anthony-tuininga commented 2 months ago

Hmm, that sounds like debugging this issue could be a nightmare! Based on the traces you gave earlier it seems like the server has either given up or has sent a partial packet for some reason. If you can't perform a packet capture you could add a print() statement to the data_received() method of the BaseAsyncProtocol class -- just to see if any data is actually being received! If you are able to try thick mode, that might also be informative. And if you can verify that a regular connection still works, then I think there is something haywire with the asyncio implementation -- that would be helpful as well. Hopefully you can do some of this testing so we can track down what is causing this issue for you!

cjbj commented 2 weeks ago

@kisv-SOPTIM I'll 'suspend' this until we can be provided with more information.

oracle / python-oracledb

oracledb.connect_async hangs forever #320