python / cpython

The Python programming language
https://www.python.org
Other
63.36k stars 30.34k forks source link

destroy_gil() is not called over multiple Py_Initialize() / Py_Finalize() calls #104324

Open adr26 opened 1 year ago

adr26 commented 1 year ago

The following simple balanced calls to Py_Initialize() / Py_FinalizeEx() trigger an error when running under Microsoft's Application Verifier on Windows:

  Py_Initialize();
  Py_FinalizeEx();
  Py_Initialize();
  Py_FinalizeEx();

The above sequence causes there to be two calls to create_gil() for the main GIL, without an intervening call to destroy_gil(): the mismatched use of PyMUTEX_INIT()/PyMUTEX_FINI() (create_gil() calls PyMUTEX_INIT() and destroy_gil() calls PyMUTEX_FINI()) translates into mismatched calls to Win32 APIs InitializeCriticalSection() and DeleteCriticalSection(), which are then detected by Application Verifier.

[Note, however, that this error is not specific to Windows and the GIL is being incorrectly re-initialised without a call to PyMUTEX_FINI() on all platforms.]

The first Py_Initialize() causes a call to create_gil() on interp->ceval.gil for the main interpreter (_PyRuntime._main_interpreter), but as a result of bpo-9901, finalize_interp_delete() (as called from Py_FinalizeEx()) currently defers calling _PyEval_FiniGIL() (and thereby destroy_gil()) until the next Py_Initialize():

static void
finalize_interp_delete(PyInterpreterState *interp)
{
    /* Cleanup auto-thread-state */
    _PyGILState_Fini(interp);

    /* We can't call _PyEval_FiniGIL() here because destroying the GIL lock can
       fail when it is being awaited by another running daemon thread (see
       bpo-9901). Instead pycore_create_interpreter() destroys the previously
       created GIL, which ensures that Py_Initialize / Py_FinalizeEx can be
       called multiple times. */

https://github.com/python/cpython/blame/41aff464cef83d2655029ddd180a51110e8d7f8e/Python/pylifecycle.c#L1740

Unfortunately, the second call Py_Initialize() doesn't call destroy_gil() before calling create_gil() a second time on the main interpreter's GIL, causing this error.

@ericsnowcurrently's comment on init_interp_create_gil() ('XXX This is broken with a per-interpreter GIL'):

static PyStatus
init_interp_create_gil(PyThreadState *tstate, int own_gil)
{
    PyStatus status;

    /* finalize_interp_delete() comment explains why _PyEval_FiniGIL() is
       only called here. */
    // XXX This is broken with a per-interpreter GIL.
    _PyEval_FiniGIL(tstate->interp);

notes that the call to _PyEval_FiniGIL() in init_interp_create_gil() doesn't work with per-interpreter GIL, but it's worse than this as it's broken with just re-creating the main interpreter!

The reason for all this appears to be that the GIL is re-initialised by _PyRuntime_Initialize(), before _PyEval_FiniGIL() has a chance to get its hands on it. Prior to GH-104210 from @ericsnowcurrently landing yesterday (and on previous Python versions), _PyEval_FiniGIL() would detect that the GIL is reinitialised by a prior call to _gil_initialize() within _PyRuntime_Initialize(). However, now the main interpreter (_PyRuntime._main_interpreter) has a GIL pointer (interp->ceval.gil) which is cleared when _PyRuntimeState_Init() resets the runtime to _PyRuntimeState_INIT:

    if (runtime->_initialized) {
        // Py_Initialize() must be running again.
        // Reset to _PyRuntimeState_INIT.
        memcpy(runtime, &initial, sizeof(*runtime));
    }

I have reproduced this with recent cpython commit bf89d4283a28dd00836f2c312a9255f543f93fc7 and have attached a log of the callstacks (avrf-gil.txt).

Linked PRs

ericsnowcurrently commented 1 year ago

Thanks for the detailed analysis (and PR), @adr26! I'll take a look.

adr26 commented 1 year ago

Thank you @ericsnowcurrently:

I set the PR up as draft, as I wasn't sure whether this would be the right way to fix the issue — I would be very open to any direction on this.

I have re-run the test_threading mentioned in bpo-9901 a couple of times locally (on Windows) and it passed, so there is some evidence that this provides a fix that doesn't regress the reason that the call to _PyEval_FiniGIL() was deferred in the first place. I would value any comments on whether any further directed test coverage is needed to ensure specifically that we don't regress the original reason this change was put in.

Just to be clear about my wording above: the sequence of Py_Initialize() / Py_FinalizeEx() calls fails on multiple 3.x versions I've tested with. In each case, there's a lack of GIL cleanup before it's re-initialised. The only difference with the per-interpreter GIL changes is the mechanism that _PyEval_FiniGIL() uses to conclude it doesn't need to call destroy_gil()

In both cases, the result was the same (incorrect) behaviour of destroy_gil() not getting called in the deferred _PyEval_FiniGIL(). Given you'd noted this deferred call to _PyEval_FiniGIL() was suspect, I thought you would be the right person to ask, so this isn't impugning gh-104210 — the problem was there both before and after it 😄!

Finally, if the code seems logically correct, then there is a stylistic question as to whether now that _PyEval_FiniGIL() is local to ceval_gil.c, its name should be changed and the external declaration in pycore_ceval.h removed... I'd welcome feedback on that too.

Thanks, Andrew R

ericsnowcurrently commented 3 months ago

@adr26, I started taking a look at this, but realized that there have been quite a few changes in this area recently for the free-threading build. Could you take a look and see if you can still reproduce the problem using the 3.13 branch?

adr26 commented 3 months ago

Hi @ericsnowcurrently, I'm on a vacation just now but will have a look at this once I'm back & let you know if I can reproduce the problem still: thank you for looking into this!