identified a possible hang in async cursor.executemany()

zzzeek commented 6 months ago

hi gang -

our sqlalchemy test suite for the new async oracledb driver we are developing hangs semi-reliably on one particular series of tests that are fielding an intentional IntegrityError from a cursor.executemany() call.

the condition is extremely difficult to reproduce, and I've come up with a script that seems to reproduce it a bit more portably when run on a linux machine, but it still takes many iterations for the condition to hit. The script below runs the same set of operations over and over again - running it against an Oracle 18c express database here, it will hang usually within 200 iterations. but i have no idea if it will do this in other circumstances. if i move the create table step out of the loop, it doesnt hang at all.

import asyncio
import oracledb
import uuid

async def main():
    conn = await oracledb.connect_async(
        dsn="(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=oracle18c)(PORT=1521))(CONNECT_DATA=(SID=xe)))",
        password="tiger",
        user="scott",
    )
    cursor = conn.cursor()

    try:
        await cursor.execute("DROP TABLE data")
    except:
        pass

    cursor.close()

    i = 0
    while True:
        await run_async_thing(conn)
        i += 1
        print(f"finished run: {i}")

async def run_async_thing(conn):
    cursor = conn.cursor()

    await cursor.execute(
        """
CREATE TABLE data (
    id VARCHAR2(30 CHAR) NOT NULL,
    data VARCHAR2(50 CHAR),
    has_server_default VARCHAR2(30 CHAR) DEFAULT 'some_server_default',
    "uid" CHAR(32),
    PRIMARY KEY (id)
)

"""
    )

    ret0, ret1, ret2 = (
        cursor.var(oracledb.STRING, arraysize=10),
        cursor.var(oracledb.STRING, arraysize=10),
        cursor.var(oracledb.STRING, arraysize=10),
    )

    parameters = [
        {
            "data": f"d{i}",
            '"uid"': uuid.uuid4().hex,
            "ret_0": ret0,
            "ret_1": ret1,
            "ret_2": ret2,
        }
        for i in range(10)
    ]

    try:
        await cursor.executemany(
            'INSERT INTO data (data, "uid") VALUES (:data, :"uid") '
            "RETURNING data.id, data.data, data.has_server_default "
            "INTO :ret_0, :ret_1, :ret_2",
            parameters,
        )
    except oracledb.IntegrityError as err:
        print("got error as expected!")

    print("next go!")

    await cursor.execute("DROP TABLE data")
    cursor.close()

asyncio.run(main())

so if I run it here it looks like:

got error as expected!
next go!
finished run: 1
got error as expected!
next go!
finished run: 2
got error as expected!
next go!
finished run: 3
got error as expected!
next go!
finished run: 4
got error as expected!
next go!

... keeps running ...

finished run: 29
got error as expected!
next go!
finished run: 30
got error as expected!
next go!
finished run: 31

then it's hung. press ctrl-C and stack shows as follows:

^CTraceback (most recent call last):
  File "/usr/lib64/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/asyncio/base_events.py", line 684, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/classic/dev/sqlalchemy/test4.py", line 23, in main
    await run_async_thing(conn)
  File "/home/classic/dev/sqlalchemy/test4.py", line 62, in run_async_thing
    await cursor.executemany(
  File "/home/classic/.venv3/lib64/python3.12/site-packages/oracledb/cursor.py", line 1067, in executemany
    await self._impl.executemany(
  File "src/oracledb/impl/thin/cursor.pyx", line 297, in executemany
  File "src/oracledb/impl/thin/protocol.pyx", line 817, in _process_single_message
  File "src/oracledb/impl/thin/protocol.pyx", line 818, in oracledb.thin_impl.BaseAsyncProtocol._process_single_message
  File "src/oracledb/impl/thin/protocol.pyx", line 769, in _process_message
  File "src/oracledb/impl/thin/protocol.pyx", line 872, in _reset
  File "src/oracledb/impl/thin/packet.pyx", line 571, in wait_for_packets_async
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/classic/dev/sqlalchemy/test4.py", line 77, in <module>
    asyncio.run(main())
  File "/usr/lib64/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/asyncio/runners.py", line 123, in run
    raise KeyboardInterrupt()
KeyboardInterrupt

that's what I got so far. Will see if @caselit can reproduce on his environment.

CaselIT commented 6 months ago

Running from windows with oracle on docker (podman on fedora over wsl2 actually) I can't reproduce (stopped with loop over 1700) with 23 (image gvenzl/oracle-free:23-slim) and 21 (image gvenzl/oracle-xe:21-slim)

With 18 (image gvenzl/oracle-xe:18-slim) it tripped at 534. Stack trace below

Maybe it has something to do with version 18 of oracle?

@zzzeek maybe the best solution is to update the ci to 21 or 23? I'll leave the script running a bit more on 21 to see if it happens

Traceback (most recent call last):
  File "C:\Users\cfede\scoop\apps\python\current\Lib\asyncio\runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\cfede\scoop\apps\python\current\Lib\asyncio\base_events.py", line 684, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "C:\Users\cfede\Dev\GitHub\sqlalchemy\test4.py", line 25, in main
    await run_async_thing(conn)
  File "C:\Users\cfede\Dev\GitHub\sqlalchemy\test4.py", line 64, in run_async_thing
    await cursor.executemany(
  File "C:\Users\cfede\venvs\sa\Lib\site-packages\oracledb\cursor.py", line 1067, in executemany
    await self._impl.executemany(
  File "src\\oracledb\\impl/thin/cursor.pyx", line 297, in executemany
  File "src\\oracledb\\impl/thin/protocol.pyx", line 817, in _process_single_message
  File "src\\oracledb\\impl/thin/protocol.pyx", line 818, in oracledb.thin_impl.BaseAsyncProtocol._process_single_message
  File "src\\oracledb\\impl/thin/protocol.pyx", line 769, in _process_message
  File "src\\oracledb\\impl/thin/protocol.pyx", line 872, in _reset
  File "src\\oracledb\\impl/thin/packet.pyx", line 571, in wait_for_packets_async
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\cfede\Dev\GitHub\sqlalchemy\test4.py", line 79, in <module>
    asyncio.run(main())
  File "C:\Users\cfede\scoop\apps\python\current\Lib\asyncio\runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "C:\Users\cfede\scoop\apps\python\current\Lib\asyncio\runners.py", line 123, in run
    raise KeyboardInterrupt()
KeyboardInterrupt

CaselIT commented 6 months ago

Just to add that the same suite has run for years using oracledb and cx-oracle with the sync version and it never hanged

zzzeek commented 6 months ago

well we can update to 21/23 sure, for now I just have the test marked to skip on oracledb async. great news it repros!

CaselIT commented 6 months ago

I'll leave the script running a bit more on 21 to see if it happens

over 25k and it was still running. By this limited test it seems this has something to do with 18 only.

interestingly when I went to stop the 18 container (gvenzl/oracle-xe:18-slim) after the script hanged the db did not close cleanly, and got this warning

WARN[0010] StopSignal SIGTERM failed to stop container oracle in 10 seconds, resorting to SIGKILL

anthony-tuininga commented 6 months ago

I noticed with #277 that Python 3.11 and higher do not experience the hang but Python 3.10 and lower do. I also noticed that the uvloop alternative implementation works fine with Python 3.10 and lower. Which Python version are you using? Are you able to try with the uvloop implementation? It may be something specific to the asyncio implementation.

zzzeek commented 6 months ago

I get the hang here with python 3.12 and we see it on 3.11 as well

anthony-tuininga commented 6 months ago

Ok. Thanks for confirming that. That suggests the issues are unrelated. It was worth checking though!

anthony-tuininga commented 6 months ago

I was able to replicate with Oracle Database 18c and discovered that it was a problem with the break/reset mechanism timing. I have pushed a patch that corrects this for me. If you are able to build from source you can verify that it works for you, too.

CaselIT commented 6 months ago

I've tried with 18c on docker installing from source locally and I could no longer reproduce, stopped when the counter was over 26k. (this is with python on windows to oracle on podman on wsl)

Seems to be fine here.

zzzeek commented 6 months ago

yup seems fine now, I confirmed I can still reproduce w/ the old version and the main version fixes the issue.

anthony-tuininga commented 6 months ago

Thanks for the confirmation!

anthony-tuininga commented 5 months ago

The patch has been included in version 2.0.1 which was just released.

anthony-tuininga commented 1 month ago

@zzzeek, @CaselIT, I have just pushed this change which pretty much reverts the patch that was originally made to resolve this issue. I was able to confirm that the issue remains in the original code prior to the patch but no longer does in the code I just pushed. Just to be safe, if you could run the code as well with this commit and verify that it is not an issue for you, that would be appreciated. I believe other changes have eliminated the original issue, but I'd like to be sure!

zzzeek commented 1 month ago

is there any quick way to build just the thin client without it looking for OCI for the thick client? or can you put up a wheel file somewhere

anthony-tuininga commented 1 month ago

The build of the driver doesn't require OCI or the thick client in any way. You can simply build from source. That said, I believe I was able to confirm that the original issue reported here has been corrected a different way -- so this is just paranoia, I think!

zzzeek commented 1 month ago

OK I can confirm the latest main branch does not have the issue, after I re-confirmed that 2.0.0 still reproduces on this machine, so we're good

anthony-tuininga commented 1 month ago

Thanks, @zzzeek!

oracle / python-oracledb

identified a possible hang in async cursor.executemany() #278