python / cpython

The Python programming language
https://www.python.org
Other
63.35k stars 30.34k forks source link

Possible deadlock at shutdown while recursively acquiring head lock #102126

Closed Zac-HD closed 1 year ago

Zac-HD commented 1 year ago

https://github.com/HypothesisWorks/hypothesis/issues/3585 is reproducible on Python 3.10.10 but not 3.10.9, and so we suspect that https://github.com/python/cpython/pull/100922 may have introduced a (rare?) deadlock while fixing the data race in https://github.com/python/cpython/issues/100892.

The downstream bug report on Hypothesis includes a reliable (but not minimal reproducer) on OSX - though it's unclear whether this might be an artifact of different patch versions of CPython on the various machines which have checked so far.

Linked PRs

handrews commented 1 year ago

I'm the downstream bug reporter, and I have verified on my M1 Pro Mac OS 12.6.2 laptop that:

I have not had time to find a more minimal reproduction.

sergei-maertens commented 1 year ago

I've created a smaller reproducible example here: https://github.com/sergei-maertens/threading-deadlock

The advantage is that it's possible to reproduce on a linux/docker context and there's no actual application code involved except for two libraries that happen to trigger this when combined (coverage + hypothesis).

@CharString remarked that the hypothesis bug report requires two calls to the st.recursive function and we only have one, however it is possible that we end up having two because there's one call at import time and then another in a different thread due to the way coverage works

I looked into the coverage code a bit yesterday and they do some funky stuff to eventually pass it to exec (see https://github.com/nedbat/coveragepy/blob/master/coverage/execfile.py#L211) which would probably account for the additional thread? I'm not very familiar with these lower level concepts though.

carljm commented 1 year ago

I was able to repro this issue on my Mac using Docker and the instructions at https://github.com/sergei-maertens/threading-deadlock

I also attempted to repro using the same repro.py and requirements.txt, but on an Amazon EC2 bare-metal Ubuntu Linux, building CPython myself (aiming to confirm bisection to the specific commit) but was unable to repro in that environment on any CPython commit (I tried 3.10 latest and main branch). The repro steps I tried:

> cd ~/cpython
> ./configure
> make -j
> cd ~/threading-deadlock
> ~/cpython/python -m venv venv
> ./venv/bin/pip install -r requirements.txt
> COVERAGE_DEBUG=process,pid ./venv/bin/coverage run repro.py
47548.f44a: cwd is now '/home/carljm/threading-deadlock'
47548.f44a: New process: executable: '/home/carljm/threading-deadlock/venv/bin/python'
47548.f44a: New process: cmd: ['./venv/bin/coverage', 'run', 'repro.py']
47548.f44a: New process: pid: 47548, parent pid: 22454
irrelevant
47548.f44a: atexit: pid: 47548, instance: <coverage.control.Coverage object at 0x7fc178b4a540>
karolyi commented 1 year ago

Hey,

I've run into the same problem and went to #python on liberachat where I got some help. I have a django project that uses django-haystack, and the same phenomenon occurs. Because of this bug, my entire CI and production system started acting weird because of the hanging processes.

Upon debugging the hanging process with gdb with the python extension enabled, I've got the following debug log:

source /usr/local/src/Python-3.10.10/python-gdb.py

(gdb) thread apply all backtrace

Thread 1 (LWP 101784 of process 2863):
#0  0x0000000800b085ba in _umtx_op () from /lib/libc.so.7
#1  0x0000000800a8ab26 in sem_clockwait_np () from /lib/libc.so.7
#2  0x00000000004ff984 in PyThread_acquire_lock_timed (lock=0x80121a000, microseconds=<optimized out>, intr_flag=intr_flag@entry=0) at Python/thread_pthread.h:486
#3  0x00000000004ffbac in PyThread_acquire_lock (lock=<optimized out>, waitflag=waitflag@entry=1) at Python/thread_pthread.h:752
#4  0x000000000054f175 in local_clear (self=self@entry=0x810b31f60) at ./Modules/_threadmodule.c:861
#5  0x000000000054f266 in local_dealloc (self=0x810b31f60) at ./Modules/_threadmodule.c:893
#6  0x00000000004701f3 in subtype_dealloc (self=<ClientIndex at remote 0x810b31f60>) at Objects/typeobject.c:1460
#7  0x000000000044caad in _Py_DECREF (op=<optimized out>) at ./Include/object.h:500
#8  _Py_XDECREF (op=<optimized out>) at ./Include/object.h:567
#9  free_keys_object (keys=0x810b21330) at Objects/dictobject.c:628
#10 0x000000000044d430 in dictkeys_decref (dk=0x810b21330) at Objects/dictobject.c:337
#11 dict_dealloc (mp=0x810b68fc0) at Objects/dictobject.c:2076
#12 0x000000000044d375 in _Py_DECREF (op=<optimized out>) at ./Include/object.h:500
#13 _Py_XDECREF (op=<optimized out>) at ./Include/object.h:567
#14 dict_dealloc (mp=0x810b5ce40) at Objects/dictobject.c:2068
#15 0x0000000000470321 in _Py_DECREF (op=<optimized out>) at ./Include/object.h:500
#16 subtype_dealloc (self=<UnifiedIndex(_indexes={<ModelBase(__module__='myproject.booking.models', __doc__='Recipient of an `Order`.', autocomplete_fields={'first_name': 'given-name', 'last_name': 'family-name', 'salutation': 'honorific-title', 'title': 'honorific-suffix', 'company': 'organization', 'street_and_nr': 'address-line1', 'address_ext': 'address-line2', 'postal_code': 'postal-code', 'place': 'address-level2', 'state': 'address-level1', 'country': 'country', 'phone': 'tel', 'fax': 'fax', 'email': 'email', 'birth_date': 'bday'}, placeholder_fields={'first_name': <__proxy__(_proxy____args=('John',), _proxy____kw={}) at remote 0x80cd54220>, 'last_name': <__proxy__(_proxy____args=('Doe',), _proxy____kw={}) at remote 0x80cd542b0>, 'company': <__proxy__(_proxy____args=('Example company',), _proxy____kw={}) at remote 0x80cd54340>, 'street_and_nr': <__proxy__(_proxy____args=('Example street 1',), _proxy____kw={}) at remote 0x80cd543d0>, 'address_ext': <__proxy__(_proxy____args=('Apartment, suite, unit, building, floor',), _proxy__...(truncated)) at Objects/typeobject.c:1438
#17 0x000000000044d375 in _Py_DECREF (op=<optimized out>) at ./Include/object.h:500
#18 _Py_XDECREF (op=<optimized out>) at ./Include/object.h:567
#19 dict_dealloc (mp=0x810b690c0) at Objects/dictobject.c:2068
#20 0x0000000000470321 in _Py_DECREF (op=<optimized out>) at ./Include/object.h:500
#21 subtype_dealloc (self=<MyProjectSearchEngine(using='default', options={'ENGINE': 'myproject.base.haystack.MyProjectSearchEngine', 'PATH': '/home/project/build/backend/xapian_index/default', 'myproject-lockfile': '/home/project/build/backend/xapian_index/default.lockfile'}, queries=[], _index=<UnifiedIndex(_indexes={<ModelBase(__module__='myproject.booking.models', __doc__='Recipient of an `Order`.', autocomplete_fields={'first_name': 'given-name', 'last_name': 'family-name', 'salutation': 'honorific-title', 'title': 'honorific-suffix', 'company': 'organization', 'street_and_nr': 'address-line1', 'address_ext': 'address-line2', 'postal_code': 'postal-code', 'place': 'address-level2', 'state': 'address-level1', 'country': 'country', 'phone': 'tel', 'fax': 'fax', 'email': 'email', 'birth_date': 'bday'}, placeholder_fields={'first_name': <__proxy__(_proxy____args=('John',), _proxy____kw={}) at remote 0x80cd54220>, 'last_name': <__proxy__(_proxy____args=('Doe',), _p...(truncated)) at Objects/typeobject.c:1438
#22 0x000000000044caad in _Py_DECREF (op=<optimized out>) at ./Include/object.h:500
#23 _Py_XDECREF (op=<optimized out>) at ./Include/object.h:567
#24 free_keys_object (keys=0x810b40240) at Objects/dictobject.c:628
#25 0x000000000044d430 in dictkeys_decref (dk=0x810b40240) at Objects/dictobject.c:337
#26 dict_dealloc (mp=0x810b69480) at Objects/dictobject.c:2076
#27 0x000000000044caad in _Py_DECREF (op=<optimized out>) at ./Include/object.h:500
#28 _Py_XDECREF (op=<optimized out>) at ./Include/object.h:567
#29 free_keys_object (keys=0x810a7f3c0) at Objects/dictobject.c:628
#30 0x000000000044d430 in dictkeys_decref (dk=0x810a7f3c0) at Objects/dictobject.c:337
#31 dict_dealloc (mp=0x808cea580) at Objects/dictobject.c:2076
#32 0x000000000044d5d0 in _Py_DECREF (op={'connections': {'default': <MyProjectSearchEngine(using='default', options={'ENGINE': 'myproject.base.haystack.MyProjectSearchEngine', 'PATH': '/home/project/build/backend/xapian_index/default', 'myproject-lockfile': '/home/project/build/backend/xapian_index/default.lockfile'}, queries=[], _index=<UnifiedIndex(_indexes={<ModelBase(__module__='myproject.booking.models', __doc__='Recipient of an `Order`.', autocomplete_fields={'first_name': 'given-name', 'last_name': 'family-name', 'salutation': 'honorific-title', 'title': 'honorific-suffix', 'company': 'organization', 'street_and_nr': 'address-line1', 'address_ext': 'address-line2', 'postal_code': 'postal-code', 'place': 'address-level2', 'state': 'address-level1', 'country': 'country', 'phone': 'tel', 'fax': 'fax', 'email': 'email', 'birth_date': 'bday'}, placeholder_fields={'first_name': <__proxy__(_proxy____args=('John',), _proxy____kw={}) at remote 0x80cd54220>, 'last_name': <__proxy__...(truncated)) at ./Include/object.h:500
#33 delitem_common (mp=mp@entry=0x808925f40, hash=hash@entry=2156362729, ix=<optimized out>, old_value={'connections': {'default': <MyProjectSearchEngine(using='default', options={'ENGINE': 'myproject.base.haystack.MyProjectSearchEngine', 'PATH': '/home/project/build/backend/xapian_index/default', 'myproject-lockfile': '/home/project/build/backend/xapian_index/default.lockfile'}, queries=[], _index=<UnifiedIndex(_indexes={<ModelBase(__module__='myproject.booking.models', __doc__='Recipient of an `Order`.', autocomplete_fields={'first_name': 'given-name', 'last_name': 'family-name', 'salutation': 'honorific-title', 'title': 'honorific-suffix', 'company': 'organization', 'street_and_nr': 'address-line1', 'address_ext': 'address-line2', 'postal_code': 'postal-code', 'place': 'address-level2', 'state': 'address-level1', 'country': 'country', 'phone': 'tel', 'fax': 'fax', 'email': 'email', 'birth_date': 'bday'}, placeholder_fields={'first_name': <__proxy__(_proxy____args=('John',), _proxy____kw={}) at remote 0x80cd54220>, 'last_name': <__proxy__...(truncated)) at Objects/dictobject.c:1667
#34 0x0000000000450bc8 in _PyDict_DelItem_KnownHash (op={}, key=<weakref.ReferenceType at remote 0x808cf26b0>, hash=2156362729) at Objects/dictobject.c:1719
#35 0x000000000054ee45 in _localdummy_destroyed (localweakref=<optimized out>, dummyweakref=<weakref.ReferenceType at remote 0x808cf26b0>) at ./Modules/_threadmodule.c:1057
#36 0x0000000000591459 in cfunction_vectorcall_O (func=<built-in method _localdummy_destroyed of weakref.ReferenceType object at remote 0x808cf18f0>, args=0x7fffffffd3c8, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/methodobject.c:516
#37 0x00000000004aef59 in _PyObject_VectorcallTstate (nargsf=9223372036854775809, kwnames=0x0, args=0x7fffffffd3c8, callable=<built-in method _localdummy_destroyed of weakref.ReferenceType object at remote 0x808cf18f0>, tstate=0x801249280) at ./Include/cpython/abstract.h:114
#38 PyObject_CallOneArg (arg=<optimized out>, func=<built-in method _localdummy_destroyed of weakref.ReferenceType object at remote 0x808cf18f0>) at ./Include/cpython/abstract.h:184
#39 handle_callback (ref=<optimized out>, callback=<built-in method _localdummy_destroyed of weakref.ReferenceType object at remote 0x808cf18f0>) at Objects/weakrefobject.c:952
#40 0x00000000004b25b3 in PyObject_ClearWeakRefs (object=object@entry=<_thread._localdummy at remote 0x80877be90>) at Objects/weakrefobject.c:998
#41 0x000000000054ea76 in localdummy_dealloc (self=0x80877be90) at ./Modules/_threadmodule.c:676
#42 0x000000000044caad in _Py_DECREF (op=<optimized out>) at ./Include/object.h:500
#43 _Py_XDECREF (op=<optimized out>) at ./Include/object.h:567
#44 free_keys_object (keys=0x80ef0f180) at Objects/dictobject.c:628
#45 0x000000000044d430 in dictkeys_decref (dk=0x80ef0f180) at Objects/dictobject.c:337
#46 dict_dealloc (mp=0x802cb48c0) at Objects/dictobject.c:2076
#47 0x00000000004f0365 in _Py_DECREF (op=<optimized out>) at ./Include/object.h:500
#48 PyThreadState_Clear (tstate=tstate@entry=0x801249280) at Python/pystate.c:848
#49 0x00000000004f0420 in interpreter_clear (interp=0x80124e7c0, tstate=0x801249280) at Python/pystate.c:298
#50 0x00000000004f079c in _PyInterpreterState_Clear (tstate=<optimized out>) at Python/pystate.c:361
#51 0x00000000004ea9a0 in finalize_interp_clear (tstate=tstate@entry=0x801249280) at Python/pylifecycle.c:1665
#52 0x00000000004eb655 in Py_FinalizeEx () at Python/pylifecycle.c:1843
#53 0x00000000004eed4a in Py_FinalizeEx () at Python/pylifecycle.c:1708
#54 Py_Exit (sts=1) at Python/pylifecycle.c:2858
#55 0x00000000004f3575 in handle_system_exit () at Python/pythonrun.c:775
#56 _PyErr_PrintEx (tstate=0x801249280, set_sys_last_vars=set_sys_last_vars@entry=1) at Python/pythonrun.c:785
#57 0x00000000004f3b16 in PyErr_PrintEx (set_sys_last_vars=1) at Python/pythonrun.c:880
#58 PyErr_Print () at Python/pythonrun.c:886
#59 _PyRun_SimpleFileObject (fp=fp@entry=0x800bbb6c0, filename=filename@entry='/home/project/build/backend/manage.py', closeit=closeit@entry=1, flags=flags@entry=0x7fffffffd718) at Python/pythonrun.c:462
#60 0x00000000004f3d95 in _PyRun_AnyFileObject (fp=0x800bbb6c0, filename=filename@entry='/home/project/build/backend/manage.py', closeit=closeit@entry=1, flags=flags@entry=0x7fffffffd718) at Python/pythonrun.c:90
#61 0x000000000042644b in pymain_run_file_obj (skip_source_first_line=0, filename='/home/project/build/backend/manage.py', program_name='python') at Modules/main.c:353
#62 pymain_run_file (config=0x80124eb70) at Modules/main.c:372
#63 pymain_run_python (exitcode=0x7fffffffd714) at Modules/main.c:587
#64 Py_RunMain () at Modules/main.c:666
#65 0x00000000004269e3 in pymain_main (args=0x7fffffffd850) at Modules/main.c:696
#66 Py_BytesMain (argc=argc@entry=9, argv=argv@entry=0x7fffffffd910) at Modules/main.c:720
#67 0x000000000041c065 in main (argc=argc@entry=9, argv=argv@entry=0x7fffffffd910) at ./Programs/python.c:15
#68 0x00000000004253e0 in _start (ap=<optimized out>, cleanup=<optimized out>) at /usr/src/lib/csu/amd64/crt1_c.c:75

Hope this can help somebody.

karolyi commented 1 year ago

@sergei-maertens I can reproduce the bug with your test on my CI, FreeBSD jail with python3.10.10, and not with python3.10.9. Python is clearly the culprit.

Haven't tried on my Linux box, but probably it would result in the same phenomenon.

karolyi commented 1 year ago

Compiling python from the main branch gives a python that doesn't have this issue.

I tried to revert to https://github.com/python/cpython/commit/762745a124cbc297cf2fe6f3ec9ca1840bb2e873 in there to see if the repo state there results in a Python 3.12.0a4+ (tags/v3.12.0a4-4-g762745a124:762745a124) that hangs, but it doesn't.

carljm commented 1 year ago

@karolyi if you try compiling Python from a different commit, say the same commit that was released as 3.10.10, can you repro the issue?

karolyi commented 1 year ago

@carljm the tag v3.10.10 refers a commit in the 3.10 branch, and the entire 3.10 branch is broken now in terms of this bug as of the referred merge.

Reverting the referred (cherry-picked) commit in the 3.10 branch fixed the issue, so the culprit is indeed https://github.com/python/cpython/commit/762745a124cbc297cf2fe6f3ec9ca1840bb2e873, however checking out to the main branch version at this commit results in a working 3.12.

Hence, git bisect isn't helping here.

Something is in there in the main branch that fixes this issue. Probably architectural changes that are way harder for me to figure out.

carljm commented 1 year ago

The reason I ask is because on Ubuntu Linux, as mentioned above, I couldn't reproduce the issue even on 3.10 branch, when compiling Python myself. So are you saying that when compiling Python yourself, you are able to repro the issue from 3.10 branch, but not from main? I didn't see that clearly mentioned above. (When you say you reproed with 3.10.10 and not 3.10.9, it's not clear where those Pythons came from.)

karolyi commented 1 year ago

Interesting. I'm testing on my FreeBSD box now, since I don't use docker and I don't want to pollute my linux box (Manjaro) with a self compiled python. I might just try it with an actual Linux VM.

Also the production server which is an ubuntu 20.04 (focal fossa) using http://ppa.launchpad.net/deadsnakes/ppa/ubuntu/, suffers from this issue on python 3.10.10 as well.

When you say you reproed with 3.10.10 and not 3.10.9, it's not clear where those Pythons came from.

Those were the official tarballs from the python.org site, until I started compiling from the github git repo.

kumaraditya303 commented 1 year ago

Working on a fix...

kumaraditya303 commented 1 year ago

Actually this bug is very old from git blame, any head lock acquisition at shutdown can deadlock while tstates are being cleared, https://github.com/python/cpython/pull/102222 fixes this. https://github.com/python/cpython/pull/100922 just made it happens to trigger this.

kumaraditya303 commented 1 year ago

Marking release blocker, cc @pablogsal @ambv

karolyi commented 1 year ago

@kumaraditya303

I'm bisecting the main branch as we speak, cherry-picking https://github.com/python/cpython/commit/762745a124cbc297cf2fe6f3ec9ca1840bb2e873 on top of each bisect operation.

There is a commit in main that fixes this issue, and I'm eager to find out which one it is.

kumaraditya303 commented 1 year ago

There is a commit in main that fixes this issue, and I'm eager to find out which one it is.

Whatever that commit does, the problem exists in main too, it must be that is just delaying calling the finalizer so it doesn't deadlocks.

karolyi commented 1 year ago

Are you able to reproduce the problem in main as well? For me on FreeBSD, the main branch doesn't hang.

kumaraditya303 commented 1 year ago

@karolyi Can you try applying the following patch on 3.10 and test it? It is https://github.com/python/cpython/pull/102222 backported to 3.10 head.

diff --git a/Python/pystate.c b/Python/pystate.c
index df98eb11bb..c7a6af5da8 100644
--- a/Python/pystate.c
+++ b/Python/pystate.c
@@ -293,11 +293,19 @@ interpreter_clear(PyInterpreterState *interp, PyThreadState *tstate)
         _PyErr_Clear(tstate);
     }

+    // Clear the current/main thread state last.
     HEAD_LOCK(runtime);
-    for (PyThreadState *p = interp->tstate_head; p != NULL; p = p->next) {
+    PyThreadState *p = interp->tstate_head;
+    HEAD_UNLOCK(runtime);
+    while (p != NULL) {
+        // See https://github.com/python/cpython/issues/102126
+        // Must be called without HEAD_LOCK held as it can deadlock
+        // if any finalizer tries to acquire that lock.
         PyThreadState_Clear(p);
+        HEAD_LOCK(runtime);
+        p = p->next;
+        HEAD_UNLOCK(runtime);
     }
-    HEAD_UNLOCK(runtime);

     Py_CLEAR(interp->audit_hooks);
karolyi commented 1 year ago

Yes, will do, hold on.

In the meantime, you can look into https://github.com/python/cpython/commit/c314198fadd3dc79ee15f375a409163d8fb33586, it's the commit with which the main branch doesn't hang, reverting it causes the main branch to hang.

kumaraditya303 commented 1 year ago

In the meantime, you can look into https://github.com/python/cpython/commit/c314198fadd3dc79ee15f375a409163d8fb33586, it's the commit with which the main branch doesn't hang, reverting it causes the main branch to hang.

This is a GC timing issue which is unpredictable by nature. Any unrelated can change the way objects are deallocated or when finalizers are called and possibly delay it.

karolyi commented 1 year ago

There is an error while compiling with your patch:

gcc -pthread -c -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall  -O2 -pipe  -std=c99 -Wextra -Wno-unused-result -Wn
o-unused-parameter -Wno-missing-field-initializers -Werror=implicit-function-declaration -fvisibility=hidden  -I./Include/internal 
 -I. -I./Include    -DPy_BUILD_CORE -o Python/pystate.o Python/pystate.c
Python/pystate.c: In function 'interpreter_clear':
Python/pystate.c:298:32: error: 'PyInterpreterState' {aka 'struct _is'} has no member named 'threads'; did you mean 'num_threads'?
  298 |     PyThreadState *p = interp->threads.head;
      |                                ^~~~~~~
      |                                num_threads
*** Error code 1

Stop.
karolyi commented 1 year ago

If you meant tstate_head there, I've corrected it to that, compiled it and it doesn't hang now.

kumaraditya303 commented 1 year ago

Yeah, sorry for the typo. Thanks for checking.

karolyi commented 1 year ago

I've also checked my use case with django-haystack and django, the process hanging disappeared there as well. This seems to be the right solution.

karolyi commented 1 year ago

Not sure if this the right place to ask, but can we get a hotfix version until 3.10.11 arrives?

pablogsal commented 1 year ago

Not sure if this the right place to ask, but can we get a hotfix version until 3.10.11 arrives?

Thanks for checking with us, but unfortunately this doesn't qualify for a hotfix release of the interpreter. The release process is very involved and unfortunately we cannot do it as much as we would like. This fix will be available in the next regular bugfix release of the interpreter, which is in a month or two.