python / cpython

The Python programming language
https://www.python.org
Other
62.78k stars 30.08k forks source link

3.5.0 regression - Fatal Python error: Cannot recover from stack overflow #69409

Closed 311eef77-bab8-48e0-b836-68a7a1f4c797 closed 2 years ago

311eef77-bab8-48e0-b836-68a7a1f4c797 commented 9 years ago
BPO 25222
Nosy @vstinner, @vadmium, @serhiy-storchaka, @ajdavis, @iritkatriel

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['interpreter-core'] title = '3.5.0 regression - Fatal Python error: Cannot recover from stack overflow' updated_at = user = 'https://bugs.python.org/behackett' ``` bugs.python.org fields: ```python activity = actor = 'iritkatriel' assignee = 'none' closed = True closed_date = closer = 'iritkatriel' components = ['Interpreter Core'] creation = creator = 'behackett' dependencies = [] files = [] hgrepos = [] issue_num = 25222 keywords = [] message_count = 14.0 messages = ['251453', '251461', '251466', '251469', '251476', '251493', '251497', '251499', '251502', '251539', '251540', '251541', '251555', '404347'] nosy_count = 6.0 nosy_names = ['vstinner', 'martin.panter', 'serhiy.storchaka', 'emptysquare', 'behackett', 'iritkatriel'] pr_nums = [] priority = 'normal' resolution = 'out of date' stage = 'resolved' status = 'closed' superseder = None type = None url = 'https://bugs.python.org/issue25222' versions = ['Python 3.5'] ```

311eef77-bab8-48e0-b836-68a7a1f4c797 commented 9 years ago

While running PyMongo's test suite against python 3.5.0 the interpreter inconsistently aborts when we test encoding a recursive data structure like:

evil = {}
evil['evil'] = evil

The test that triggers this was added to test the use of Py_EnterRecursiveCall in PyMongo's C extensions and passes without issues on all previous CPython releases back to 2.4.x (the oldest version PyMongo supports). The interesting thing about this abort is that it only occurs when testing PyMongo *without* its C extensions. The stacktrace looks like this:

test_bad_encode (test_bson.TestBSON) ... Exception ignored in: \<bound method PeriodicExecutor.close of \<pymongo.periodic _executor.PeriodicExecutor object at 0x00000000046AB0F0>> RecursionError: maximum recursion depth exceeded while calling a Python object Fatal Python error: Cannot recover from stack overflow.

Thread 0x00000b6c (most recent call first): File "C:\10gen\mongo-python-driver\pymongo\periodic_executor.py", line 105 in _run File "C:\Python35\lib\threading.py", line 871 in run File "C:\Python35\lib\threading.py", line 923 in _bootstrap_inner File "C:\Python35\lib\threading.py", line 891 in _bootstrap

Thread 0x00000690 (most recent call first): File "C:\Python35\lib\threading.py", line 297 in wait File "C:\10gen\mongo-python-driver\pymongo\thread_util.py", line 199 in wait File "C:\10gen\mongo-python-driver\pymongo\periodic_executor.py", line 110 in _run File "C:\Python35\lib\threading.py", line 871 in run File "C:\Python35\lib\threading.py", line 923 in _bootstrap_inner File "C:\Python35\lib\threading.py", line 891 in _bootstrap

Thread 0x00000900 (most recent call first): File "C:\Python35\lib\threading.py", line 297 in wait File "C:\10gen\mongo-python-driver\pymongo\thread_util.py", line 199 in wait File "C:\10gen\mongo-python-driver\pymongo\periodic_executor.py", line 110 in _run File "C:\Python35\lib\threading.py", line 871 in run File "C:\Python35\lib\threading.py", line 923 in _bootstrap_inner File "C:\Python35\lib\threading.py", line 891 in _bootstrap

Current thread 0x00000a20 (most recent call first): File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 692 in _element_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in \<listcomp> File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in _encode_mapping File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 462 in \<listcomp> File "C:\10gen\mongo-python-driver\bson\init.py", line 462 in _encode_list File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 692 in _element_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in \<listcomp> File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in _encode_mapping File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 462 in \<listcomp> File "C:\10gen\mongo-python-driver\bson\init.py", line 462 in _encode_list File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 692 in _element_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in \<listcomp> File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in _encode_mapping File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 462 in \<listcomp> File "C:\10gen\mongo-python-driver\bson\init.py", line 462 in _encode_list File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 692 in _element_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in \<listcomp> File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in _encode_mapping File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 462 in \<listcomp> File "C:\10gen\mongo-python-driver\bson\init.py", line 462 in _encode_list File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 692 in _element_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in \<listcomp> File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in _encode_mapping File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 462 in \<listcomp> File "C:\10gen\mongo-python-driver\bson\init.py", line 462 in _encode_list File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 692 in _element_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in \<listcomp> File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in _encode_mapping File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 462 in \<listcomp> File "C:\10gen\mongo-python-driver\bson\init.py", line 462 in _encode_list File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 692 in _element_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in \<listcomp> File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in _encode_mapping File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 462 in \<listcomp> File "C:\10gen\mongo-python-driver\bson\init.py", line 462 in _encode_list File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 692 in _element_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in \<listcomp> File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in _encode_mapping File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 462 in \<listcomp> File "C:\10gen\mongo-python-driver\bson\init.py", line 462 in _encode_list File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 692 in _element_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in \<listcomp> File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in _encode_mapping File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 462 in \<listcomp> File "C:\10gen\mongo-python-driver\bson\init.py", line 462 in _encode_list File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 692 in _element_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in \<listcomp> File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in _encode_mapping File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 462 in \<listcomp> File "C:\10gen\mongo-python-driver\bson\init.py", line 462 in _encode_list File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 692 in _element_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in \<listcomp> File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in _encode_mapping File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 462 in \<listcomp> File "C:\10gen\mongo-python-driver\bson\init.py", line 462 in _encode_list File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 692 in _element_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in \<listcomp> File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in _encode_mapping File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 462 in \<listcomp> File "C:\10gen\mongo-python-driver\bson\init.py", line 462 in _encode_list File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 692 in _element_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in \<listcomp> File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in _encode_mapping File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 462 in \<listcomp> File "C:\10gen\mongo-python-driver\bson\init.py", line 462 in _encode_list File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 692 in _element_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in \<listcomp> File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in _encode_mapping File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 462 in \<listcomp> File "C:\10gen\mongo-python-driver\bson\init.py", line 462 in _encode_list File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 692 in _element_to_bson

My current working theory is if RecursionError is raised during a thread switch the recursion depth is deemed too great to recover from. Maybe the GIL is no longer being held during some call to _Py_CheckRecursiveCall? This failure looks suspiciously similar to bpo-22971 given that, in that ticket, the failure occurs while or immediately after running test_thread.

The failure has been observed on 64 and 32bit Windows 7 VMs as well as a 64bit Amazon Linux instance. Sadly, I haven't yet been able to devise a reproduction that doesn't involve running PyMongo's test suite.

vstinner commented 9 years ago

I'm not sure that I understood the bug.

You are testing a function which enters an unlimited loop. You expect to get a RecursionError, but Python does crash.

Well, Python has no perfect protection again stack overflow. It's only best effect. You should change sys.setrecursionlimit() to a lower limit to try to limit the risk of a crash.

It's hard to predict how the stack will be used.

FYI the error message comes from Python/ceval.c:

    if (tstate->overflowed) {
        if (tstate->recursion_depth > recursion_limit + 50) {
            /* Overflowing while handling an overflow. Give up. */
            Py_FatalError("Cannot recover from stack overflow.");
        }
        return 0;
    }

I suggest you to use sys.setrecursionlimit() in your test, or remove the test.

Maybe there is a real bug, but if it's the case, it will probably be hard to investigate it. You need to dig into gdb.

311eef77-bab8-48e0-b836-68a7a1f4c797 commented 9 years ago

Well, Python has no perfect protection again stack overflow. It's only best effect.

That's interesting. I thought there was a stronger contract, and it appeared that way in all previous cpython releases back to 2.4. Again, this failure is new with 3.5.0.

You should change sys.setrecursionlimit() to a lower limit to try to limit the risk of a crash.

That's a great suggestion and it appears to work around the issue, thanks.

Feel free to close this won't fix, but given that this appears to be the second report of this issue for 3.5 (see bpo-22971) someone who understands interpreter internals may want to look into the problem a bit more first.

311eef77-bab8-48e0-b836-68a7a1f4c797 commented 9 years ago

I used sys.setrecursionlimit(250) - the default appears to be 1000 on all my test machines - and that reduced the occurrence of the abort but didn't completely solve the problem. There must be something more going on here.

vadmium commented 9 years ago

I saw this error recently on some 3.4 builtbots:

http://buildbot.python.org/all/builders/AMD64%20Windows10%203.4/builds/74/steps/test/logs/stdio (failed last 4/5 builds)

[306/390] test_json Fatal Python error: Cannot recover from stack overflow.

Current thread 0x00000fd8 (most recent call first): File "D:\buildarea\3.4.bolen-windows10\build\lib\json\encoder.py", line 250 in iterencode File "D:\buildarea\3.4.bolen-windows10\build\lib\json\encoder.py", line 192 in encode File "D:\buildarea\3.4.bolen-windows10\build\lib\test\test_json\test_recursion.py", line 96 in test_endless_recursion

http://buildbot.python.org/all/builders/x86%20XP-4%203.4/builds/1224/steps/test/logs/stdio (only failed 1/5 builds, stack trace essentially the same)

vstinner commented 9 years ago

Bernie Hackett wrote: "Feel free to close this won't fix, but given that this appears to be the second report of this issue for 3.5 (see bpo-22971) ..."

It's not because you get the same error message that the root cause is the same.

Martin Panter wrote: "I saw this error recently on some 3.4 builtbots: (...) AMD64 Windows10 3.4"

Again, the root cause can be very different. If you want to work on this crash, please open a separated issue for this specific crash on test_json. The code on Windows can be a little bit different, and ceval.c evolved between Python 3.4 and 3.6.

serhiy-storchaka commented 9 years ago

The json code was almost not changed, so it can be related to changes in import machinery or in ceval.c if they consume more stack than previous.

Yet one suspect is Argument Clinic that adds additional wrappers.

vstinner commented 9 years ago

I'm unable to reproduce the issue on Linux. I tried system python 3.4.2 and the development version of Python 3.6. First I tried with no MongoDB server running. Then I tried with a MongoDB server running.

I also ran the pymongo test suite on Windows 8.1 with Python development version (3.6) with no MongoDB server.

With no MongoDB server running, I don't see threads running pymongo/periodic_executor.py.

Any hint to try to reproduce the crash?

vstinner commented 9 years ago

Serhiy Storchaka added the comment:

The json code was almost not changed, so it can be related to changes in import machinery or in ceval.c if they consume more stack than previous.

Yet one suspect is Argument Clinic that adds additional wrappers.

This issue is about the number of Python frames, right? Argument Clinic may use a little bit more bytes on the C stack, but it doesn't create Python frames. Right?

--

Current thread 0x00000a20 (most recent call first): File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 692 in _element_to_bson File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in \<listcomp> File "C:\10gen\mongo-python-driver\bson\init.py", line 433 in _encode_mapping File "C:\10gen\mongo-python-driver\bson\init.py", line 653 in _name_value_to_bson ...

The "Fatal Python error: Cannot recover from stack overflow" occurs when Python already raised a RecursionError (RuntimeError on Python \< 3.5) in the current Python thread, but the recursion limit was reached again, even with 50 more frames than when the RecursionError was raised.

Since a RecursionError exception was raised, I would expect the program to crash in an except block, but it's not the case.

_name_value_to_bson:653: I don't have exactly the same version (I'm reading pymongo 3.0.3 source), but I only see a try/except KeyError here. No bare "except: pass" or "except Exception: ...".

_element_to_bson:692: Ditto, only a try/except AttributeError.

--

@Bernie: Would you be able to get a C backtrace using gdb on such crash?

311eef77-bab8-48e0-b836-68a7a1f4c797 commented 9 years ago

File "D:\buildarea\3.4.bolen-windows10\build\lib\test\test_json\test_recursion.py", line 96 in test_endless_recursion

That test (and reason for existence) is almost exactly the same as the PyMongo test that causes abort reported in this ticket. I'm not surprised it is also failing. test_pickle appears to have a similar issue. For test_json see here:

https://hg.python.org/cpython/rev/a21f5af476cb

And PyMongo's equivalent test:

https://github.com/mongodb/mongo-python-driver/blob/4bbe2133a14df716b1dffe8ab7957ed67149b2cd/test/test_bson.py#L538-L543

I'm unable to reproduce the issue on Linux.

The abort is very rare for me on Linux, so much so that I originally thought it only occurred on Windows. I can't reproduce it on a local Linux machine, but it has happened inconsistently in our Jenkins cluster. For example (note that this run includes a commit that uses sys.setrecursionlimit(250) and still aborts):

https://jenkins.mongodb.com/view/Python/job/mongo-python-driver/404/extensions=without-extensions,label=linux64,mongodb_configuration=single_server,mongodb_option=noauth,mongodb_server=26-release,python_language_version=3.5/console

First I tried with no MongoDB server running. Then I tried with a MongoDB server running.

The problem only occurs for me when testing with a MongoDB instance running. Running background threads seem to be a key requirement for reproduction. Note also that this problem does not reproduce when using PyMongo's C extensions, only when using the pure python BSON encoder. You can run the tests like this from a clean git checkout:

python3.5 setup.py --no_ext test

I don't have exactly the same version (I'm reading pymongo 3.0.3 source)

Clone master from github:

https://github.com/mongodb/mongo-python-driver

@Bernie: Would you be able to get a C backtrace using gdb on such crash?

Not likely, since I can't reproduce it under Linux locally. But I *can* reproduce it consistently on a Windows 7 machine that has VS2015. I'll see what I can get out of the Visual Studio debugger and report back.

311eef77-bab8-48e0-b836-68a7a1f4c797 commented 9 years ago

Clone master from github:

You'll also have to git checkout 4bbe2133a14df716b1dffe8ab7957ed67149b2cd to roll back the setrecursionlimit change I added to work around this issue.

Using 100 seems to have permanently vanquished the abort, but there's no way the stack is overflowing with such a low setting and Py_EnterRecursiveCall shouldn't cause python to abort all by itself.

311eef77-bab8-48e0-b836-68a7a1f4c797 commented 9 years ago

Here's a chunk of the call stack from the Visual Studio debugger using the debug build. Py_FatalError seems to be called multiple times:

ucrtbased.dll!72d27f30() Unknown [Frames below may be incorrect and/or missing, no symbols loaded for ucrtbased.dll] [External Code] python35_d.dll!Py_FatalError(const char msg) Line 1375 C
python35_d.dll!_Py_CheckRecursiveCall(const char
where) Line 733 C python35_d.dll!PyObject_Call(_object func, _object arg, _object kw) Line 2144 C python35_d.dll!call_function_tail(_object callable, _object args) Line 2173 C python35_d.dll!callmethod(_object func, const char format, char va, int is_size_t) Line 2242 C python35_d.dll!_PyObject_CallMethodId(_object o, _Py_Identifier name, const char format, ...) Line 2287 C python35_d.dll!flush_std_files() Line 488 C
python35_d.dll!Py_FatalError(const char
msg) Line 1354 C
python35_d.dll!_Py_CheckRecursiveCall(const char where) Line 733 C python35_d.dll!PyObject_Call(_object func, _object arg, _object kw) Line 2144 C python35_d.dll!do_call(_object func, _object pp_stack, int na, int nk) Line 4883 C python35_d.dll!call_function(_object pp_stack, int oparg) Line 4679 C python35_d.dll!PyEval_EvalFrameEx(_frame f, int throwflag) Line 3183 C python35_d.dll!fast_function(_object func, _object pp_stack, int n, int na, int nk) Line 4751 C python35_d.dll!call_function(_object pp_stack, int oparg) Line 4677 C python35_d.dll!PyEval_EvalFrameEx(_frame f, int throwflag) Line 3183 C python35_d.dll!fast_function(_object func, _object pp_stack, int n, int na, int nk) Line 4751 C python35_d.dll!call_function(_object pp_stack, int oparg) Line 4677 C python35_d.dll!PyEval_EvalFrameEx(_frame f, int throwflag) Line 3183 C python35_d.dll!_PyEval_EvalCodeWithName(_object _co, _object globals, _object locals, _object args, int argcount, _object kws, int kwcount, _object defs, int defcount, _object kwdefs, _object closure, _object name, _object qualname) Line 3962 C python35_d.dll!fast_function(_object func, _object * pp_stack, int n, int na, int nk) Line 4760 C

311eef77-bab8-48e0-b836-68a7a1f4c797 commented 9 years ago

On second thought, _Py_CheckRecursiveCall may be being called recursively through Py_FatalError.

iritkatriel commented 2 years ago

There was work on recursion in recent versions and the Fatal Error is less likely now. We will not investigate recursion issues on 3.5, so I will close this and request that you create a new issue if you have problem with a current Python version.