brandtbucher commented 4 days ago

Crash report

When running:

PYTHON_UOPS_OPTIMIZE=0 PYTHON_LLTRACE=1 ./python -m test test_multiprocessing_fork.test_processes -m test_free_from_gc

On a debug build, we get an assertion failure:

Using random seed: 2033438694
0:00:00 load avg: 1.57 Run 1 test sequentially in a single process
0:00:00 load avg: 1.57 [1/1] test_multiprocessing_fork.test_processes
Created a proto-trace for _path_join (<frozen importlib._bootstrap_external>:131) at byte offset 110 -- length 29
Created a proto-trace for _path_join (<frozen importlib._bootstrap_external>:131) at byte offset 116 -- length 13
Created a proto-trace for Heap._free_pending_blocks (/home/brandtbucher/cpython/Lib/multiprocessing/heap.py:258) at byte offset 126 -- length 482
Created a proto-trace for _TestHeap.test_free_from_gc (/home/brandtbucher/cpython/Lib/test/_test_multiprocessing.py:4114) at byte offset 486 -- length 27
Created a proto-trace for BufferWrapper.__init__ (/home/brandtbucher/cpython/Lib/multiprocessing/heap.py:326) at byte offset 0 -- length 516
python: Python/optimizer.c:1277: uop_optimize: Assertion `length <= UOP_MAX_TRACE_LENGTH' failed.
Fatal Python error: Aborted

Current thread 0x00007f6a0dc56740 (most recent call first):
  File "/home/brandtbucher/cpython/Lib/multiprocessing/heap.py", line 326 in __init__
  File "__init__", line ??? in __init__
  File "/home/brandtbucher/cpython/Lib/test/_test_multiprocessing.py", line 4130 in test_free_from_gc
  File "/home/brandtbucher/cpython/Lib/unittest/case.py", line 606 in _callTestMethod
  File "/home/brandtbucher/cpython/Lib/unittest/case.py", line 660 in run
Fatal Python error: Segmentation fault

Segmentation fault (core dumped)

On a non-debug build, we get a buffer overflow:

Using random seed: 1659198442
0:00:00 load avg: 13.83 Run 1 test sequentially in a single process
0:00:00 load avg: 13.83 [1/1] test_multiprocessing_fork.test_processes
*** stack smashing detected ***: terminated
Fatal Python error: Aborted

Current thread 0x00007f11f9a27740 (most recent call first):
  File "/home/brandtbucher/cpython/Lib/multiprocessing/heap.py", line 326 in __init__
  File "__init__", line ??? in __init__
  File "/home/brandtbucher/cpython/Lib/test/_test_multiprocessing.py", line 4130 in test_free_from_gc
  File "/home/brandtbucher/cpython/Lib/unittest/case.py", line 606 in _callTestMethod
  File "/home/brandtbucher/cpython/Lib/unittest/case.py", line 660 in run
  File Fatal Python error: Segmentation fault

Segmentation fault (core dumped)

It looks like the optimizer just happens to do enough useful work that adding error and exit stubs doesn't put us over UOP_MAX_TRACE_LENGTH anymore. My hunch is that the real bug is in translate_bytecode_to_trace, where we're not reserving enough space for some instruction(s).

CC @Fidget-Spinner and @markshannon.

JeffersGlass commented 3 days ago

For what it's worth, there seem to be a number of tests which cause this failure. Below is a list of tests, and the final UOp that fails to be added, with the message No room for (UOP) (need X, got Y).

Note that a few of the tests don't seem to show this kind of error, but instead underflow the trace stack.

Test	UOp Failure
test.test_multiprocessing_fork.test_processes	LOAD_FAST (81)
test.test_multiprocessing_forkserver.test_processes	[track stack underflow]
test.test_multiprocessing_spawn.test_processes	_EXIT_TRACE (300)
test_buffer	_LOAD_CONST_IMMORTAL (207)
test_complex	COMPARE_OP_FLOAT(185)
test_configparser	LOAD_ATTR_METHOD_WITH_VALUES (200)
test_datetime	STORE_FAST (107)
test_decimal	[track stack underflow]
test_itertools	_POP_TOP (29)
test_long	BINARY_OP (42)
test_math	CALL_KW_BOUND_METHOD (170)
test_plistlib	LOAD_FAST (81)
test_pow	LOAD_ATTR_METHOD_NO_DICT (199)
test_slice	_TO_BOOL_BOOL (220)
test_struct	CALL_METHOD_DESCRIPTOR_FAST (175)
test_tarfile	_LOAD_CONST (79)
test_venv	[trace stack underflow]
test_zipfile	POP_JUMP_IF_TRUE (98)
test_zipimport	[trace stack underflow]
test_zipimport_support	CALL_KW_BOUND_METHOD (170)
test_zoneinfo	CALL_PY_EXACT_ARGS (180)

JeffersGlass commented 3 days ago

Here's a simpler (if slightly cursed) reproducer:

# foo.py

def foo():
    # Padding to increase number of UOps in trace
    _ = reversed(list(reversed(list(reversed(list(reversed(list(reversed(list(
            reversed(list(reversed(list(reversed(list(reversed(list(reversed(list(
            reversed(list(reversed(list(reversed(list(reversed(list(reversed(list(
            reversed(list(reversed(list(reversed(list(reversed(list(reversed(list(
            reversed(list(reversed(list(reversed(list(reversed(list(reversed(list(
            reversed(list(reversed(list(reversed(list(reversed(list(reversed(list(
            reversed(list(reversed(list(reversed(list(reversed(list(reversed(list(
            reversed(list(reversed(list(reversed(list(reversed(list(reversed(list(
            reversed(list(reversed(list(reversed(list(reversed(list(reversed(list(
            reversed(list(reversed(list(reversed(list(reversed(list(reversed(list(
            reversed(list(reversed(list(reversed(list(reversed(list(reversed(list(         
                [0]
            ))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))
            ))))))))))))))))))))))))))))))))))))))))

for _ in range(4096): #Minimum value
    foo()

./configure --enable-experimental-jit --with-pydebug
make
PYTHON_UOPS_OPTIMIZE=0 PYTHON_LLTRACE=1 ./python foo.py

Created a proto-trace for <module> (/home/jglass/Documents/cpython/foo.py:1) at byte offset 44 -- length 590
python: Python/optimizer.c:1287: uop_optimize: Assertion `length <= UOP_MAX_TRACE_LENGTH' failed.
Aborted (core dumped)

brandtbucher commented 3 days ago

Thanks @JeffersGlass!

I talked with @markshannon, and we think it's probably worth reworking translate_bytecode_to_trace to be less error-prone. The problem is that we need to reserve enough space for the next bytecode's uop expansion before translating it, which is difficult to get right and difficult to check. Rather, we should be able to gracefully handle the case where we run out of space in the middle of translating an instruction, by reverting to a known safepoint (the last complete expansion) and ending the trace there.

Here's some pseudocode to capture the general idea:

buffer = [None] * MAX_TRACE_LENGTH
space_remaining = MAX_TRACE_LENGTH
last_complete = 0
index = 0

def add_uop(uop):
    if space_remaining <= 0:
        raise OutOfSpace
    buffer[index] = uop
    space_remaining -= 1
    index += 1

# Save room for final _CHECK_VALIDITY_AND_SET_IP + _EXIT_TRACE:
space_remaining -= 2
add_uop(_START_EXECUTOR)
add_uop(_MAKE_WARM)
try:
    for instruction in bytecode:
        if has_error(instruction):
            # Save room for error stub:
            space_remaining -= 1
        if has_exit(instruction):
            # Save room for exit stub:
            space_remaining -= 1
        add_uop(_CHECK_VALIDITY_AND_SET_IP)
        for uop in uop_expansion(instruction):
            add_uop(uop)
        # Successfully translated the entire instruction:
        last_complete = index - 1
except OutOfSpace:
    index = last_complete
add_uop(_CHECK_VALIDITY_AND_SET_IP)
add_uop(_EXIT_TRACE)

JeffersGlass commented 2 days ago

This makes sense to me!

I think the current crash is related to prepare_for_execution, where 'spare' positions at the end of the trace are used to make implicit deopts into explicit ones, without verifying that there's actually enough spare space at the end of the buffer. For instance, foo.py has 590 ops in its buffer before preparation, but all the possible DEOPTs mean it has 839 ops prepare_for_execution, and the assertion fails/buffer overflows.

/// optimizer.c, L1297
printf(1, "Trace length before: %d\n", length);
length = prepare_for_execution(buffer, length);
printf(1, "Trace length after : %d\n", length);

Trace length before: 590
Trace length after : 839

So probably some bounds checking is necessary inside the loop... what would the desired behavior be if we run out of room at the end of the buffer during this step?

brandtbucher commented 2 days ago

I think it makes sense to check during trace projection, rather than after. During trace projection, we know whether or not a given instruction is only partly-translated when we run out of space (which we need to avoid). If we run out of space, too, we can still successfully create a trace.

By the time we’re inserting exit/error stubs, in my opinion it’s too late. There’s a calculation that’s off during projection, and the fact that it doesn’t manifest until we insert stubs is a symptom, not a cause.

python / cpython

Disabling the uop optimizer leads to crashes #127083

Crash report