Closed The-Compiler closed 4 months ago
I did a quick check for recent changes to Objects/unicodeobject.c::replace()
but the code has been stable. The last edit was 17b4733f2ff Serhiy Storchaka 2020-04-01
. That said, a lot of other things have changed in CPython (the memory allocator, method dispatch, and the next of include files).
It would help a great deal if you could bisect the problem to a particular edit or construct a minimal reproducer that excludes third-party extensions.
I got it down from 15 minutes to 7 minutes runtime, but at this point, even just removing a handful of test files from the run makes it not reproduce anymore, from what I've tried. I don't think a minimal example is realistic at this point given that I can only get it to trigger after running >3000 test cases, but I'll keep trying to reduce it a bit more, and then do a git bisect. It will probably take me few days, unless I can get things reliable enough for a git bisect run
.
Quick status update: After trying to reduce the test files being run one by one, I found a more minimal combination that still reliably triggers the bug:
tox -e py313-pyqt66 -- \
tests/unit/browser/test_browsertab.py \
tests/unit/browser/test_caret.py \
tests/unit/browser/test_downloadview.py \
tests/unit/browser/test_history.py \
tests/unit/browser/test_hints.py \
tests/unit/browser/test_navigate.py \
tests/unit/config/test_configdata.py \
tests/unit/config/test_configtypes.py \
tests/unit/config/test_configinit.py \
tests/unit/config/test_configfiles.py \
tests/unit/mainwindow/test_messageview.py -v -s
that's still almost 2000 tests with runtime of 4 minutes though. I'll now proceed to bisecting CPython and hope that'll help find the culprit.
Bisected to 992446dd5bd3fff92ea0f8064fb19eebfe105cef:
Not sure what to make of this. The change looks quite innocent, but I've double-checked it's indeed that commit that causes the memory corruption to happen.
cc @markshannon
I got it to fail under valgrind with PYTHONMALLOC=malloc
:
tests/unit/mainwindow/test_messageview.py::test_click_messages[MouseButton.RightButton-0] ./Include/object.h:1030: _Py_NegativeRefcount: Assertion failed: object has negative ref count
Enable tracemalloc to get the memory block allocation traceback
object address : 0x5311b600
object refcount : 0
object type : 0x6599c0
object type name: str
object repr : <refcnt 0 at 0x5311b600>
Fatal Python error: _PyObject_AssertFailed: _PyObject_AssertFailed
Python runtime state: initialized
TypeError: replace() argument 2 must be str, not posix.DirEntry
Valgrind reports:
I'll play around with -X tracemalloc
and maybe clang's sanitizers next, in the hope that they can say something more.
Similar result with ASan:
tests/unit/mainwindow/test_messageview.py::test_replaced_messages[None-testid-2] =================================================================
==168082==ERROR: AddressSanitizer: heap-use-after-free on address 0x506004080b60 at pc 0x5f101832396d bp 0x7ffcbe3ffd90 sp 0x7ffcbe3ffd88
I've finally been able to reduce this a lot at least within the qutebrowser project - those two tests trigger the bug as soon as the second one runs for the 1010th time.
import pytest
from qutebrowser.config import configdata, configtypes, configdata
from qutebrowser.utils import standarddir
def test_crash_1(qapp):
standarddir.init(None)
configdata.init()
configtypes.FontBase.set_defaults(None, '10pt')
@pytest.mark.parametrize("i", range(1010))
def test_crash_2(config_stub, i):
configtypes.Font().to_py("10pt default_family")
I've not yet been able to reproduce this outside of pytest (or qutebrowser), as it still seems to be pretty sensitive about what's going on before the bug gets triggered (probably because gc related?). But at this point it looks like I should be able to cook up a minimal-ish example with a couple more hours of try and error.
Aaaand I arrived at a minimal example:
from PyQt6.QtCore import QSysInfo
def maybe_crash():
class StringHolder:
value = None
@classmethod
def set_value(cls):
# needs to be set here, setting from outside doesn't trigger the crash.
# anything that returns a QString from Qt/C++
cls.value = QSysInfo.productType()
class StringHolderSub(StringHolder):
# needs to be subclass, using StringHolder directly to access .value
# doesn't trigger the crash.
pass
for _ in range(1010): # triggers exactly after 1010 times.
StringHolder.set_value()
StringHolderSub.value
if __name__ == "__main__":
for _ in range(5):
# crash is not 100% reproducible with the minimal reproducer
maybe_crash()
Crashes reliably when using a --with-address-sanitizer
or with PYTHONMALLOC=malloc
for me. With the default allocator, it needs 2-3 times, but the for loop will take care of that.
@rhettinger @markshannon Hope that works? It still requires PyQt6, I have not tested yet if the string can also be from another third-party library. A normal Python string won't trigger it.
Sorry for the notification-heavy notes to myself here - since this was a longer process, I figured it'd be better to have my notes here than just for myself.
The PyQt code that triggers the crash is the
qpycore_PyObject_FromQString()
function. Base on that I think the following (similar) code would also trigger it...PyObject *obj; int kind; void *data; obj = PyUnicode_New(1, 127); kind = PyUnicode_KIND(obj); data = PyUnicode_DATA(obj); PyUnicode_WRITE(kind, data, 0, (Py_UCS4)'A');
If the length is initialised to 0 (rather than 1) and there is no call to
PyUnicode_WRITE()
then there is no crash.
Is the string you are creating escaping to other code before it is fully initialized? If the hash and data contradict, strange things can happen.
Can you link to the source of qpycore_PyObject_FromQString()
? That might give me a clue.
I ran a git bisect
and the regression was introduced by: commit 992446dd5bd3fff92ea0f8064fb19eebfe105cef.
commit 992446dd5bd3fff92ea0f8064fb19eebfe105cef (HEAD)
Author: Mark Shannon <mark@hotpy.org>
Date: Mon Feb 5 16:20:54 2024 +0000
GH-113462: Limit the number of versions that a single class can use. (GH-114900)
Include/cpython/object.h | 1 +
Lib/test/test_type_cache.py | 13 +++++++++++++
Misc/NEWS.d/next/Core and Builtins/2024-02-02-05-27-48.gh-issue-113462.VMml8q.rst | 2 ++
Objects/typeobject.c | 7 ++++++-
4 files changed, 22 insertions(+), 1 deletion(-)
create mode 100644 Misc/NEWS.d/next/Core and Builtins/2024-02-02-05-27-48.gh-issue-113462.VMml8q.rst
I can reproduce the bug in a reliable way without PyQt with a debug build of Python:
def maybe_crash():
class StringHolder:
value = None
@classmethod
def set_value(cls):
cls.value = b'abc'.decode()
class StringHolderSub(StringHolder):
pass
for _ in range(1010):
StringHolder.set_value()
StringHolderSub.value
if __name__ == "__main__":
for _ in range(5):
maybe_crash()
Example of output:
$ ~/python/main/python bug.py
Python/generated_cases.c.h:5040: _Py_NegativeRefcount: Assertion failed: object has negative ref count
<object at 0x7f1b8f86fa60 is freed>
Fatal Python error: _PyObject_AssertFailed: _PyObject_AssertFailed
Python runtime state: initialized
Current thread 0x00007f1b9d5f0740 (most recent call first):
File "/home/vstinner/python/3.13/bug.py", line 14 in maybe_crash
File "/home/vstinner/python/3.13/bug.py", line 18 in <module>
Abandon (core dumped)
If I revert this change on the main branch, I can no longer reproduce the bug:
diff --git a/Include/cpython/object.h b/Include/cpython/object.h
index 0ab94e5e2a..0bfc20ac9c 100644
--- a/Include/cpython/object.h
+++ b/Include/cpython/object.h
@@ -229,7 +229,6 @@ struct _typeobject {
/* bitset of which type-watchers care about this type */
unsigned char tp_watched;
- uint16_t tp_versions_used;
};
/* This struct is used by the specializer
diff --git a/Objects/typeobject.c b/Objects/typeobject.c
index 958f42430c..333ddb811c 100644
--- a/Objects/typeobject.c
+++ b/Objects/typeobject.c
@@ -1214,8 +1214,6 @@ _PyType_GetVersionForCurrentState(PyTypeObject *tp)
-#define MAX_VERSIONS_PER_CLASS 1000
-
static int
assign_version_tag(PyInterpreterState *interp, PyTypeObject *type)
{
@@ -1232,10 +1230,6 @@ assign_version_tag(PyInterpreterState *interp, PyTypeObject *type)
if (!_PyType_HasFeature(type, Py_TPFLAGS_READY)) {
return 0;
}
- if (type->tp_versions_used >= MAX_VERSIONS_PER_CLASS) {
- return 0;
- }
- type->tp_versions_used++;
if (type->tp_flags & Py_TPFLAGS_IMMUTABLETYPE) {
/* static types */
if (NEXT_GLOBAL_VERSION_TAG > _Py_MAX_GLOBAL_TYPE_VERSION_TAG) {
Thanks @vstinner, that's really helpful.
We were failing to maintain the invariant that superclass versions must be updated before subclass versions. https://github.com/python/cpython/commit/992446dd5bd3fff92ea0f8064fb19eebfe105cef exposed the bug.
It should be possible to cause the same failure with https://github.com/python/cpython/commit/992446dd5bd3fff92ea0f8064fb19eebfe105cef reverted by changing the 1010
in the reproducer to ~4 billion.
Would it make sense to revert the change for now (in 3.13 and main branches), and consider a more long term approach in Python 3.14?
@The-Compiler can you confirm that this is fixed for you on both the main and 3.13 branches?
Would it make sense to revert the change for now (in 3.13 and main branches), and consider a more long term approach in Python 3.14?
It's fixed now (pending confirmation)
I confirm that I can no longer reproduce the https://github.com/python/cpython/issues/119462#issuecomment-2135613367 crash on the 3.13 development branch. I close the issue.
The main
branch seems to be failing for me in ways that look to be unrelated (will dig into those at a later point); and indeed 3.13 now is working fine for me. Thanks @markshannon and @vstinner! :+1:
Crash report
What happened?
I'm trying to run the qutebrowser testsuite with Python 3.13, and am running into an issue where a test reproducibly fails (usually by crashing the interpreter), but only when I run the entire testsuite (not when run in isolation, or even just the tests in the same subfolder).
Given those circumstances, it seems tricky to get to a minimal example. I thought I'd open this issue in the hope of distilling things down further, and arriving at such an example. In the meantime, the best reproduction steps I can come up with are:
A few tests will fail with a
--with-pydebug
build due to timeouts, those can be ignored. After a while (~13 minutes with--with-pydebug
under gdb on my system), one of the tests intests/unit/mainwindow/test_messageview.py
will fail, usually due to a failing assertion becausePyUnicode_KIND
did return an invalid value.Failing Python code:
stacktrace:
, globals=, locals=Python Exception : There is no member named ready.
) at Python/ceval.c:598
#67 0x00007211a3c51950 in builtin_exec_impl (module=, closure=, locals=Python Exception : There is no member named ready.
, globals=Python Exception : There is no member named ready.
, source=
) at Python/bltinmodule.c:1145 #68 builtin_exec (module=, args=, nargs=, kwnames=) at Python/clinic/bltinmodule.c.h:556
#69 0x00007211a395a4ea in cfunction_vectorcall_FASTCALL_KEYWORDS (func=, args=0x7211a414a180, nargsf=, kwnames=0x0) at Objects/methodobject.c:441
#70 0x00007211a39fcce9 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=9223372036854775810, args=0x7211a414a180, callable=, tstate=0x7211a3ecb1f0 <_PyRuntime+294032>)
at ./Include/internal/pycore_call.h:168
#71 PyObject_Vectorcall (callable=, args=0x7211a414a180, nargsf=9223372036854775810, kwnames=0x0) at Objects/call.c:327
#72 0x00007211a3889b59 in _PyEval_EvalFrameDefault (tstate=0x7211a3ecb1f0 <_PyRuntime+294032>, frame=0x7211a414a0d8, throwflag=6) at Python/generated_cases.c.h:813
#73 0x00007211a3c36c04 in PyObject_Call (kwargs=0x0, args=Python Exception : There is no member named ready.
, callable=) at Objects/call.c:373
#74 pymain_run_module (modname=, set_argv0=set_argv0@entry=1) at Modules/main.c:297
#75 0x00007211a3c498d2 in pymain_run_python (exitcode=0x7ffecd272a68) at Modules/main.c:633
#76 Py_RunMain () at Modules/main.c:718
#77 0x00007211a3f72cd0 in ??? () at /usr/lib/libc.so.6
#78 0x00007211a3f72d8a in __libc_start_main () at /usr/lib/libc.so.6
#79 0x00005e3a1e856055 in _start ()
```
Sometimes I've also seen a
MemoryError
on the line callingstr.replace
, or an ominous:with the following Python stack (note that jinja is involved, which might or might not be a trigger?):
which leads me to the conclusion that there must be some sort of memory corruption going on there.
On one run, I've also see a GC-related crash, which I'm not sure is related:
Python stack:
C stack:
, globals=, locals=Python Exception : There is no member named ready.
)
at Python/ceval.c:598
#48 0x00007f311c651950 in builtin_exec_impl (module=, closure=, locals=Python Exception : There is no member named ready.
, globals=Python Exception : There is no member named ready.
, source=
) at Python/bltinmodule.c:1145 #49 builtin_exec (module=, args=, nargs=, kwnames=)
at Python/clinic/bltinmodule.c.h:556
#50 0x00007f311c35a4ea in cfunction_vectorcall_FASTCALL_KEYWORDS
(func=, args=0x7f311cb33180, nargsf=, kwnames=0x0) at Objects/methodobject.c:441
#51 0x00007f311c3fcce9 in _PyObject_VectorcallTstate
(kwnames=0x0, nargsf=9223372036854775810, args=0x7f311cb33180, callable=, tstate=0x7f311c8cb1f0 <_PyRuntime+294032>) at ./Include/internal/pycore_call.h:168
#52 PyObject_Vectorcall
(callable=, args=0x7f311cb33180, nargsf=9223372036854775810, kwnames=0x0) at Objects/call.c:327
#53 0x00007f311c289b59 in _PyEval_EvalFrameDefault
(tstate=0x7f311c8cb1f0 <_PyRuntime+294032>, frame=0x7f311cb330d8, throwflag=6) at Python/generated_cases.c.h:813
#54 0x00007f311c636c04 in PyObject_Call (kwargs=0x0, args=Python Exception : There is no member named ready.
, callable=)
at Objects/call.c:373
#55 pymain_run_module (modname=, set_argv0=set_argv0@entry=1) at Modules/main.c:297
#56 0x00007f311c6498d2 in pymain_run_python (exitcode=0x7ffc5a236bd8) at Modules/main.c:633
#57 Py_RunMain () at Modules/main.c:718
#58 0x00007f311c043cd0 in ??? () at /usr/lib/libc.so.6
#59 0x00007f311c043d8a in __libc_start_main () at /usr/lib/libc.so.6
#60 0x000064c810b08055 in _start ()
```
I'm lost here on how to best debug this further. My best bet would be to try and at least get an example that I can run more quickly, and then try and bisect CPython in order to find the offending change. If there are any other guesses or approaches to debug what could be going on here, I'd be happy to dig in further.
CPython versions tested on:
CPython main branch
Operating systems tested on:
Linux
Output from running 'python -VV' on the command line:
Python 3.13.0b1 (main, May 23 2024, 09:21:12) [GCC 13.2.1 20240417]
Linked PRs