I've been trying to add support for Python 3.11 to multicorn2, and struggling with unexpected crashes. I believe I've identified the core of the problem.
And as a result, this begins invoking PostgreSQL's C-based exception handling code, which uses setjmp/longjmp to move execution to the current error handler block (PG_TRY()...PG_CATCH()...PG_END_TRY()). However, the Python interpreter is in the middle of executing a C function call and its state is completely corrupt when control is never returned to the interpreter.
The specific interaction that is breaking right now is:
calling insert() on a FDW inspects some data and decides to do a log_to_postgres with an error
that causes the transaction to be aborted, causing multicorn_xact_callback to be invoked, and attempting to call rollback on the FDW
when running under valgrind, memory access errors occur at this point
finally a segfault occurs during the errorCheck() after rollback is invoked, when it attempts to import the traceback module -- strongly indicating to me that the interpreter is just fully unusable
Apparently in Python 3.10 and earlier the interpreter is able to cope with this, which is somewhat of a minor miracle to me. But in 3.11, by the time we re-enter the interpreter after this abrupt exit, it doesn't know how to cope and it begins to access invalid memory, crashing shortly afterwards.
I think the right fix for this would be to:
Change log_to_postgres so that ERROR and FATAL log levels are translated into Python exceptions, which are thrown back to the Python interpreter (eg. PostgresLogException)
In errorCheck(), check if the exception is a PostgresLogException, unwrap the details and log it, triggering Postgres' C exception handling but only after we're not in the middle of a Python invocation.
This would have the benefit of unwinding the Python stack -- for example any finally blocks and context managers that are in-use in Python would be exited cleanly. And I think it would leave the interpreter in a good clean state for re-entry later.
I think that this would still leave a residual problem which is likely "OK"-ish: any other place that could trigger PostgreSQL exceptions is going to fail to perform Python reference counting correctly. For example any invocation of errorCheck() in the code base today could trigger Postgres's exception handling which will prevent the following Py_DECREF from being invoked. That could lead to memory leaks in any long-running backends using multicorn, but, I think it's not of critical importance.
I've been trying to add support for Python 3.11 to multicorn2, and struggling with unexpected crashes. I believe I've identified the core of the problem.
When the Python function
log_to_postgres
is used with the levelERROR
, as it is wisely used in some FDWs to report extended information about a problem... for example: https://github.com/pgsql-io/multicorn2/blob/b0a274c3aeec8341f93ec3f8328cad9f105ae4ee/python/multicorn/fsfdw/__init__.py#L286-L291This executes errstart/errfinish with a level of
ERROR
, of course... https://github.com/pgsql-io/multicorn2/blob/b0a274c3aeec8341f93ec3f8328cad9f105ae4ee/src/utils.c#L84-L88And as a result, this begins invoking PostgreSQL's C-based exception handling code, which uses
setjmp
/longjmp
to move execution to the current error handler block (PG_TRY()...PG_CATCH()...PG_END_TRY()
). However, the Python interpreter is in the middle of executing a C function call and its state is completely corrupt when control is never returned to the interpreter.The specific interaction that is breaking right now is:
insert()
on a FDW inspects some data and decides to do alog_to_postgres
with an errormulticorn_xact_callback
to be invoked, and attempting to call rollback on the FDWerrorCheck()
afterrollback
is invoked, when it attempts to import the traceback module -- strongly indicating to me that the interpreter is just fully unusableApparently in Python 3.10 and earlier the interpreter is able to cope with this, which is somewhat of a minor miracle to me. But in 3.11, by the time we re-enter the interpreter after this abrupt exit, it doesn't know how to cope and it begins to access invalid memory, crashing shortly afterwards.
I think the right fix for this would be to:
log_to_postgres
so thatERROR
andFATAL
log levels are translated into Python exceptions, which are thrown back to the Python interpreter (eg.PostgresLogException
)errorCheck()
, check if the exception is aPostgresLogException
, unwrap the details and log it, triggering Postgres' C exception handling but only after we're not in the middle of a Python invocation.This would have the benefit of unwinding the Python stack -- for example any finally blocks and context managers that are in-use in Python would be exited cleanly. And I think it would leave the interpreter in a good clean state for re-entry later.
I think that this would still leave a residual problem which is likely "OK"-ish: any other place that could trigger PostgreSQL exceptions is going to fail to perform Python reference counting correctly. For example any invocation of
errorCheck()
in the code base today could trigger Postgres's exception handling which will prevent the followingPy_DECREF
from being invoked. That could lead to memory leaks in any long-running backends using multicorn, but, I think it's not of critical importance.