iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.47k stars 551 forks source link

Crash / ASan issues using VmModule.wrap_buffer from Python #17635

Open ScottTodd opened 2 weeks ago

ScottTodd commented 2 weeks ago

(Lots of red herrings here, see https://github.com/iree-org/iree/issues/17635#issuecomment-2163644331 for latest issue description)


Splitting off of https://github.com/llvm/torch-mlir/issues/3433 to discuss IREE-specific details.

I wrote some test cases for https://pytorch.org/docs/stable/generated/torch.Tensor.index_put_.html starting from PyTorch and going through iree-turbine / torch-mlir to IREE: https://gist.github.com/ScottTodd/1e95795e79d17964078217ca98a3a398. Latest results:

test_single_value                  | PASS
test_multiple_values               | PASS (then crash)
test_broadcast_value_along_axis    | FAIL
test_broadcast_value_along_indices | FAIL
test_broadcast_values_along_axis   | PASS (then crash)

The two "pass then crash" test cases used to pass without crashing and I found some interesting results bisecting through IREE releases.

release test result
iree_compiler==20240514.893 PASS --> CRASH
iree_compiler==20240508.887
iree_compiler==20240507.886 Error invoking function ...
iree_compiler==20240506.885
iree_compiler==20240505.884
iree_compiler==20240504.883
iree_compiler==20240503.882
iree_compiler==20240502.881
iree_compiler==20240501.880 Error invoking function ...
iree_compiler==20240430.879
iree_compiler==20240429.878
iree_compiler==20240428.877
iree_compiler==20240427.876 Error invoking function ...
iree_compiler==20240426.875
iree_compiler==20240425.874 Error invoking function ...
iree_compiler==20240424.873 Error invoking function ...
iree_compiler==20240423.872 PASS --> CRASH
iree_compiler==20240422.871
iree_compiler==20240421.870
iree_compiler==20240420.869 PASS --> CRASH
iree_compiler==20240419.868 PASS --> CRASH
iree_compiler==20240418.867 PASS
iree_compiler==20240417.866 PASS
iree_compiler==20240416.865
iree_compiler==20240415.864
iree_compiler==20240414.863
iree_compiler==20240412.861
iree_compiler==20240411.860
iree_compiler==20240410.859 PASS

Crash (during shutdown, maybe from writing out of bounds?):

======================================== 1 passed in 3.60s ======================================== Exception Code: 0xC0000005
 #0 0x00007ffa6c89a495 PyInit__runtime (D:\dev\projects\SHARK-TestSuite\iree_tests\pytorch\operators\.venv\Lib\site-packages\iree\_runtime_libs\_runtime.cp311-win_amd64.pyd+0x4a495)
 ...
 #5 0x00007ffa6c91bf1c nanobind::python_error::what(void) const (D:\dev\projects\SHARK-TestSuite\iree_tests\pytorch\operators\.venv\Lib\site-packages\iree\_runtime_libs\_runtime.cp311-win_amd64.pyd+0xcbf1c)

"Error invoking function" (fixed by https://github.com/iree-org/iree/pull/17339):

    def _invoke(self, arg_list, ret_list):
>       self._vm_context.invoke(self._vm_function, arg_list, ret_list)
E       ValueError: Error invoking function: D:\a\iree\iree\c\runtime\src\iree\hal\command_buffer_validation.c:363: INVALID_ARGUMENT; source and target ranges overlap within the same buffer; while invoking native function hal.command_buffer.copy_buffer; while calling import;
E       [ 2]   native hal.command_buffer.copy_buffer:0 -
E       [ 1] bytecode module.main$async:526 d:\dev\projects\SHARK-TestSuite\iree_tests\pytorch\operators\index_put_test.py:88:0
E       [ 0] bytecode module.main:62 d:\dev\projects\SHARK-TestSuite\iree_tests\pytorch\operators\index_put_test.py:88:0

20240418.867 to 20240419.868, pass --> crash

Between 20240418.867 and 20240419.868, the test_multiple_values test case started crashing.

before: https://github.com/iree-org/iree/releases/tag/candidate-20240418.867 after: https://github.com/iree-org/iree/releases/tag/candidate-20240419.868

diff (https://github.com/iree-org/iree/compare/ff624dd8b27029f4d69f884a15ff3af8606b15b9...a2476ceef53f6d4ec0f7a9cf36892647626d324b):

λ git log --oneline candidate-20240418.867..candidate-20240419.868
a2476ceef5 [metal] Disable failing semaphore submission test until fixing (#17100)
0e1e6bfe7a Clarify fusion heuristic (#17098)
125f4200d8 [CodeGen] Add a pattern to fold extract_slice consumer into xfer.write. (#17067)
f755b42289 [Codegen] Add folding in createBoundedTileSize for partially dynamic wgSize. (#17089)
a0b4853dc7 Fixes to enable out-of-tree plugin builds. (#17095)
bd1b10626c [CodeGen] Drop encoding for HAL and Flow ops when DT is not supported. (#17081)
886c4169ac [Winograd] Use TilingInterface for all levels of winograd op tiling (#17061)
244195960f Add GPU dialect dependencies to C/Python bindings. (#17090)
ab4babee07 [ConstEval] Add flag to adjust tensor size limit for hoisting (#17064)
eec081c2c6 [LLVMGPU] Fallback if dynamic dim found on vector distribute. (#17085)
f86f21cd80 [python] Expose MLIR python bindings for gpu and transform (#17088)
4778f5f909 Categorize matmul/metvec like generic ops for dispatches (#17084)
f32a87ccd2 [Flow] Move elementwise op fusion and bubble up expand shapes patterns into their own pass. (#17068)
3677fbcce8 [runtime] Add semaphore test where 2 batches wait on a former batch amongst 2 (#17080)

20240423.872 to 20240424.873 crash --> error invoking function

before: https://github.com/iree-org/iree/releases/tag/candidate-20240423.872 after: https://github.com/iree-org/iree/releases/tag/candidate-20240424.873

diff (https://github.com/iree-org/iree/compare/f5660ee72bd8e71910d00cec726375fb4b4a6de8...59532d30c0b8b3a0cdf261eaa4e69784c9fddeee):

λ git log --oneline candidate-20240423.872..candidate-20240424.873
59532d30c0 Extend DecomposeConvolutionToLowerDimOpsPass (#17069)
1b4c76f931 Adding a hoistable attr interface to allow attaching attrs to hoists. (#17139)
729ebc642f [runtime][metal] exclude properly the failing semaphore test (#17151)
30acc53257 [runtime][cts] add test where a device batch signals another and the host (#17138)
94728e34c2 [LinalgExt][Winograd] Add winograd.filter_transform op (#17102)
4bc90e7431 Run pytorch/models tests on AMDGPU with Vulkan. (#17129)
330651efa6 [plugins][ROCM] Fix minor loc source resize bug (#17133)
f7098e3f66 Moving regression suite to azure (#17140)
290d812d96 [runtime][cts] add semaphore test where a batch waits on another and a host signal (#17130)
8736479f5d Prevent stream.async.update folding unless confirmed safe. (#17135)
7192e8c195 [Preprocessing] Add support for general ContractionOps and handle dynamic dimension using pad + expand. (#17123)
d7db353adf Add initial website documentation for ONNX frontend. (#17004)
568bb31780 [runtime][cts] Add test waiting on a semaphore for finite time and fix Vulkan driver (#17126)
655b71a282 Executable library call hooks system, and a sample Linux/CPU event implementation (#15803)
3dde925d11 [VectorDistribution] Add distribution pattern for vector::MultiDimReductionOp (#17076)
ScottTodd commented 2 weeks ago

Oooh, the test crash is flaky. That makes bisecting trickier... would be easier on Linux with ASan.

I tried around 10 times at iree_compiler==20240418.867 and can't repro there.

ScottTodd commented 2 weeks ago

Still verifying with ASan on Linux to avoid the flakes, but my bisect on Windows seems to be pointing to https://github.com/iree-org/iree/commit/f32a87ccd2 for the test crash. That's "mostly" an NFC according to @MaheshRavishankar 🤔

ScottTodd commented 2 weeks ago

Bah, having trouble pinning this down.

ScottTodd commented 2 weeks ago

:/ yeah, .vmfb files produced before and after https://github.com/iree-org/iree/commit/f32a87ccd2 are identical and that commit didn't modify runtime code or python bindings. That's what my bisect pointed to though.

Going to keep trying with ASan on Linux to see if I can get a 100% repro case instead of the flaky crashes on Windows.

ScottTodd commented 2 weeks ago

Following https://iree.dev/developers/debugging/sanitizers/#asan-addresssanitizer on Linux got me a source Python build with ASan on Linux that reports errors in the test case I've been working with:

(.venv) scotttodd@scotttodd-cpu:~/scratch/tests$ LD_PRELOAD=/usr/lib/llvm-14/lib/clang/14.0.0/lib/linux/libclang_rt.asan-x86_64.so pytest --log-cli-level=info index_put_test.py::TestIndexPut::test_multiple_values
================================================================================================== test session starts ===================================================================================================
platform linux -- Python 3.11.0rc1, pytest-8.2.2, pluggy-1.5.0
rootdir: /home/scotttodd/scratch/tests
collected 1 item                                                                                                                                                                                                         

index_put_test.py::TestIndexPut::test_multiple_values 
----------------------------------------------------------------------------------------------------- live log call ------------------------------------------------------------------------------------------------------
INFO     index_put_test:index_put_test.py:110 module @module {
  func.func @main(%arg0: !torch.tensor<[3,6],f32>) -> !torch.vtensor<[3,6],f32> attributes {torch.assume_strict_symbolic_shapes} {
    %0 = torch.vtensor.literal(dense_resource<torch_tensor_3_torch.int64> : tensor<3xsi64>) : !torch.vtensor<[3],si64>
    %1 = torch.vtensor.literal(dense_resource<torch_tensor_3_torch.int64_1> : tensor<3xsi64>) : !torch.vtensor<[3],si64>
    %2 = torch.vtensor.literal(dense_resource<torch_tensor_3_torch.float32> : tensor<3xf32>) : !torch.vtensor<[3],f32>
    %3 = torch.copy.to_vtensor %arg0 : !torch.vtensor<[3,6],f32>
    %none = torch.constant.none
    %4 = torch.aten.clone %0, %none : !torch.vtensor<[3],si64>, !torch.none -> !torch.vtensor<[3],si64>
    %none_0 = torch.constant.none
    %5 = torch.aten.clone %1, %none_0 : !torch.vtensor<[3],si64>, !torch.none -> !torch.vtensor<[3],si64>
    %none_1 = torch.constant.none
    %6 = torch.aten.clone %2, %none_1 : !torch.vtensor<[3],f32>, !torch.none -> !torch.vtensor<[3],f32>
    %7 = torch.prim.ListConstruct %4, %5 : (!torch.vtensor<[3],si64>, !torch.vtensor<[3],si64>) -> !torch.list<optional<vtensor>>
    %false = torch.constant.bool false
    %8 = torch.aten.index_put %3, %7, %6, %false : !torch.vtensor<[3,6],f32>, !torch.list<optional<vtensor>>, !torch.vtensor<[3],f32>, !torch.bool -> !torch.vtensor<[3,6],f32>
    torch.overwrite.tensor.contents %8 overwrites %arg0 : !torch.vtensor<[3,6],f32>, !torch.tensor<[3,6],f32>
    return %8 : !torch.vtensor<[3,6],f32>
  }
}

{-#
  dialect_resources: {
    builtin: {
      torch_tensor_3_torch.int64: "0x08000000000000000000000001000000000000000200000000000000",
      torch_tensor_3_torch.int64_1: "0x08000000030000000000000004000000000000000500000000000000",
      torch_tensor_3_torch.float32: "0x04000000CDCCCC3DCDCC4C3E9A99993E"
    }
  }
#-}

INFO     index_put_test:index_put_test.py:128 [[0.  0.  0.  0.1 0.  0. ]
 [0.  0.  0.  0.  0.2 0. ]
 [0.  0.  0.  0.  0.  0.3]]
PASSED                                                                                                                                                                                                             [100%]

=================================================================================================== 1 passed in 5.95s ====================================================================================================
AddressSanitizer:DEADLYSIGNAL
=================================================================
==78453==ERROR: AddressSanitizer: SEGV on unknown address 0x7f04a3c4d074 (pc 0x7f04a12ad2d8 bp 0x000000000000 sp 0x7ffe78c08730 T0)
==78453==The signal is caused by a READ memory access.
    #0 0x7f04a12ad2d8  (/home/scotttodd/scratch/tests/.venv/lib/python3.11/site-packages/iree/_runtime_libs/_runtime.cpython-311-x86_64-linux-gnu.so+0xae2d8) (BuildId: 32e87a22f20d0241)
    #1 0x7f04a1321d78  (/home/scotttodd/scratch/tests/.venv/lib/python3.11/site-packages/iree/_runtime_libs/_runtime.cpython-311-x86_64-linux-gnu.so+0x122d78) (BuildId: 32e87a22f20d0241)
    #2 0x7f04a1321b86  (/home/scotttodd/scratch/tests/.venv/lib/python3.11/site-packages/iree/_runtime_libs/_runtime.cpython-311-x86_64-linux-gnu.so+0x122b86) (BuildId: 32e87a22f20d0241)
    #3 0x7f04a125482d  (/home/scotttodd/scratch/tests/.venv/lib/python3.11/site-packages/iree/_runtime_libs/_runtime.cpython-311-x86_64-linux-gnu.so+0x5582d) (BuildId: 32e87a22f20d0241)
    #4 0x5af471  (/usr/bin/python3.11+0x5af471) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
    #5 0x4d8278  (/usr/bin/python3.11+0x4d8278) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
    #6 0x62a28d  (/usr/bin/python3.11+0x62a28d) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
    #7 0x654986 in PyGC_Collect (/usr/bin/python3.11+0x654986) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
    #8 0x646d40 in Py_FinalizeEx (/usr/bin/python3.11+0x646d40) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
    #9 0x668cfb in Py_Exit (/usr/bin/python3.11+0x668cfb) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
    #10 0x6555de  (/usr/bin/python3.11+0x6555de) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
    #11 0x6555a5 in PyErr_PrintEx (/usr/bin/python3.11+0x6555a5) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
    #12 0x650a91 in _PyRun_SimpleFileObject (/usr/bin/python3.11+0x650a91) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
    #13 0x650832 in _PyRun_AnyFileObject (/usr/bin/python3.11+0x650832) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
    #14 0x64f786 in Py_RunMain (/usr/bin/python3.11+0x64f786) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
    #15 0x61ee0c in Py_BytesMain (/usr/bin/python3.11+0x61ee0c) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
    #16 0x7f04a4429d8f  (/lib/x86_64-linux-gnu/libc.so.6+0x29d8f) (BuildId: c289da5071a3399de893d2af81d6a30c62646e1e)
    #17 0x7f04a4429e3f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x29e3f) (BuildId: c289da5071a3399de893d2af81d6a30c62646e1e)
    #18 0x61ec94 in _start (/usr/bin/python3.11+0x61ec94) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/home/scotttodd/scratch/tests/.venv/lib/python3.11/site-packages/iree/_runtime_libs/_runtime.cpython-311-x86_64-linux-gnu.so+0xae2d8) (BuildId: 32e87a22f20d0241) 
==78453==ABORTING

That's at tip of tree. Going to sync back to the suspected commit ranges to see if ASan similarly complains there.

MaheshRavishankar commented 2 weeks ago

Thanks @ScottTodd . If it does get back to that commit then let me know.

ScottTodd commented 2 weeks ago

Having a hard time running my python test from an older commit.

ScottTodd commented 2 weeks ago

Argh, red herrings all around.

Running a trivial test case instead of index_put

class TorchModule(torch.nn.Module):
    def forward(self, input):
        return input + torch.ones(3, 4)

through Python in this setup with ASan on tip of tree also produces

==246621==ERROR: AddressSanitizer: SEGV on unknown address 0x7f96b787e050 (pc 0x7f96b43422d8 bp 0x000000000000 sp 0x7ffdeacb5240 T0)
==246621==The signal is caused by a READ memory access.

the problematic lines for the ASan error are

vm_module = ireert.load_vm_module(
    ireert.VmModule.wrap_buffer(
        config.vm_instance, compiled_module.map_memory()
    ),
    config,
)

Soooo... why does my TestIndexPut::test_single_value test case succeed on Windows while the TestIndexPut::test_multiple_values test case succeeds then crashes? Maybe the test cases write to different elements, near the edges of the buffers:

# Passing
[  #  col 0   col 1   col 2   col 3
    [0.0000, 0.0000, 0.0000, 0.0000],  # row 0
    [0.0000, 0.0000, 0.5000, 0.0000],  # row 1
    [0.0000, 0.0000, 0.0000, 0.0000],  # row 2
]

# Failing
[  #  col 0   col 1   col 2   col 3  col 4   col 5
    [0.0000, 0.0000, 0.0000, 0.1000, 0.0000, 0.0000],  # row 0
    [0.0000, 0.0000, 0.0000, 0.0000, 0.2000, 0.0000],  # row 1
    [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.3000],  # row 2
]

Then also why is this behavior so difficult to pin down? Is ASan missing something? Is Python a relevant detail? The index put operations are trying to do in-place processing that could be hitting a special path in Python vs through iree-run-module. Could try building a .c test file using the C API.

ScottTodd commented 2 weeks ago

Maybe the test cases write to different elements, near the edges of the buffers:

Yep! This put of a single value also passes:

# indices=[torch.tensor([0]), torch.tensor([3])],
[  #  col 0   col 1   col 2   col 3
    [0.0000, 0.0000, 0.0000, 0.5000],  # row 0
    [0.0000, 0.0000, 0.0000, 0.0000],  # row 1
    [0.0000, 0.0000, 0.0000, 0.0000],  # row 2
]

but these (writing a single value into the last row) crash:

# indices=[torch.tensor([2]), torch.tensor([3])],
[  #  col 0   col 1   col 2   col 3
    [0.0000, 0.0000, 0.0000, 0.0000],  # row 0
    [0.0000, 0.0000, 0.0000, 0.0000],  # row 1
    [0.0000, 0.0000, 0.0000, 0.5000],  # row 2
]

# indices=[torch.tensor([2]), torch.tensor([0])],
[  #  col 0   col 1   col 2   col 3
    [0.0000, 0.0000, 0.0000, 0.0000],  # row 0
    [0.0000, 0.0000, 0.0000, 0.0000],  # row 1
    [0.5000, 0.0000, 0.0000, 0.0000],  # row 2
]
ScottTodd commented 2 weeks ago

Writing into the last row of a [3, 4] tensor crashes. Writing into the last row of a [4, 4] or [6, 6] tensor does not. The full model from https://github.com/nod-ai/sharktank/issues/22 was using [?,16,32,100]. I'm wondering if the dynamic dim was odd?

ScottTodd commented 2 weeks ago

After restarting my machine, I'm only seeing the Python test crash in about 1/30 runs. That's going to make it hard to verify which test cases are definitely working and which aren't. Results were very consistent yesterday...

ScottTodd commented 2 weeks ago

Bah, just saw a crash writing into the middle of a 4x4 tensor. The crash I'm seeing might be unique to the Python bindings and maybe not even unique to index_put_. Logs: https://gist.github.com/ScottTodd/a0c0e68d1abeb3240f782045c4c70e80

ScottTodd commented 2 weeks ago

Aw, saw the same Python crash performing an elementwise add on a 4x4 tensor...

ScottTodd commented 2 weeks ago

Minimal repro for the Python AddressSanitizer report (note: this shows up when using iree-turbine with example code like https://github.com/iree-org/iree-turbine/blob/4b451f84b03f87af21a9b785b0ddd68094f43ed8/examples/aot_mlp/mlp_export_simple.py#L45-L49)

Seems to be related to VmModule.wrap_buffer() and map_memory() on a compiler output. Using VmModule.copy_buffer() avoids the ASan issue.

import iree.runtime as ireert

from iree.compiler.api import (
    Session,
    Source,
    Output,
)

session = Session()
session.set_flags("--iree-hal-target-backends=vmvx")
inv = session.invocation()
source = Source.wrap_buffer(
    session,
    b"""
builtin.module {
  func.func @abs(%input : tensor<4xf32>) -> (tensor<4xf32>) {
    %result = math.absf %input : tensor<4xf32>
    return %result : tensor<4xf32>
  }
}""",
)
inv.parse_source(source)
inv.execute()
out = Output.open_membuffer()
inv.output_vm_bytecode(out)

config = ireert.Config("local-sync")
# ASan issue goes away if this is commented out
vm_module = ireert.load_vm_module(
    ireert.VmModule.wrap_buffer(config.vm_instance, out.map_memory()),
    config,
)
(.venv) scotttodd@scotttodd-cpu:~/scratch/tests$ LD_PRELOAD=/usr/lib/llvm-14/lib/clang/14.0.0/lib/linux/libclang_rt.asan-x86_64.so ASAN_OPTIONS=detect_leaks=0 python asan_minimal_repro.py 
AddressSanitizer:DEADLYSIGNAL
=================================================================
==39396==ERROR: AddressSanitizer: SEGV on unknown address 0x7fbf57968050 (pc 0x7fbffd1712d8 bp 0x000000000000 sp 0x7fffe15017f0 T0)
==39396==The signal is caused by a READ memory access.
    #0 0x7fbffd1712d8  (/home/scotttodd/scratch/tests/.venv/lib/python3.11/site-packages/iree/_runtime_libs/_runtime.cpython-311-x86_64-linux-gnu.so+0xae2d8) (BuildId: 32e87a22f20d0241)
    #1 0x7fbffd1e5d78  (/home/scotttodd/scratch/tests/.venv/lib/python3.11/site-packages/iree/_runtime_libs/_runtime.cpython-311-x86_64-linux-gnu.so+0x122d78) (BuildId: 32e87a22f20d0241)
    #2 0x7fbffd1e5b86  (/home/scotttodd/scratch/tests/.venv/lib/python3.11/site-packages/iree/_runtime_libs/_runtime.cpython-311-x86_64-linux-gnu.so+0x122b86) (BuildId: 32e87a22f20d0241)
    #3 0x7fbffd11882d  (/home/scotttodd/scratch/tests/.venv/lib/python3.11/site-packages/iree/_runtime_libs/_runtime.cpython-311-x86_64-linux-gnu.so+0x5582d) (BuildId: 32e87a22f20d0241)
    #4 0x5af471  (/usr/bin/python3.11+0x5af471) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
    #5 0x4d8278  (/usr/bin/python3.11+0x4d8278) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
    #6 0x6557cf  (/usr/bin/python3.11+0x6557cf) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
    #7 0x654f99  (/usr/bin/python3.11+0x654f99) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
    #8 0x646d4d in Py_FinalizeEx (/usr/bin/python3.11+0x646d4d) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
    #9 0x64f5ea in Py_RunMain (/usr/bin/python3.11+0x64f5ea) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
    #10 0x61ee0c in Py_BytesMain (/usr/bin/python3.11+0x61ee0c) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
    #11 0x7fbfff629d8f  (/lib/x86_64-linux-gnu/libc.so.6+0x29d8f) (BuildId: c289da5071a3399de893d2af81d6a30c62646e1e)
    #12 0x7fbfff629e3f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x29e3f) (BuildId: c289da5071a3399de893d2af81d6a30c62646e1e)
    #13 0x61ec94 in _start (/usr/bin/python3.11+0x61ec94) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/home/scotttodd/scratch/tests/.venv/lib/python3.11/site-packages/iree/_runtime_libs/_runtime.cpython-311-x86_64-linux-gnu.so+0xae2d8) (BuildId: 32e87a22f20d0241) 
==39396==ABORTING
benvanik commented 2 weeks ago

interesting - the compiler Source seems to hang on to the buffer provided in wrap_buffer, but you could try adding a print to its close() - the buffer as provided to ireeCompilerSourceWrapBuffer is unowned (it's just a char*) so if the python side doesn't keep it live it'll go boom

iow, that Source instance must remain live until the iree_vm_module_t created from it is deleted - you could add a printf in there to see if they happen out of order

stellaraccident commented 2 weeks ago

Try with copy_buffer instead of wrap_buffer just to eliminate variables. Removes any potential alignment or ownership issues.

stellaraccident commented 2 weeks ago

Ok, so that is a definite bug with wrap_buffer used in that way. Will need to fortify testing of that and fix.

But probably not what you are trying to find...

ScottTodd commented 2 weeks ago

Try with copy_buffer instead of wrap_buffer just to eliminate variables. Removes any potential alignment or ownership issues.

My tests pass a clean ASan report when I use copy_buffer instead of wrap_buffer. That might be enough to call the lowering/codegen/runtime for index_put_ correct (except for the unhandled broadcast case in the torch-mlir lowering, tracked here: https://github.com/llvm/torch-mlir/issues/3433).

you could add a printf in there to see if they happen out of order

Trying this now. Printfs seem to be working from compiler Python code but I'm not seeing my changes to the runtime reflected in my venv... strange (my PYTHONPATH and build setup both seem fine...).

ScottTodd commented 2 weeks ago

Sorta figured out my python bindings debug setup:

I found that if I change BoundModule in system_api.py to no longer retain the SystemContext: https://github.com/iree-org/iree/blob/0a561c47c3dc77ce038aef0ca9bf8d8f02ff2f2a/runtime/bindings/python/iree/runtime/system_api.py#L123-L130

then the ASan error goes away. Something is trying to access memory that it shouldn't be, just not sure what specifically.

Printf debugging:

test.py calling Source.wrap_buffer(session, ...)
--- CompilerDriver.cpp: ireeCompilerSourceWrapBuffer, length: 167, isNullTerminated: 0 ---
test.py calling inv.parse_source(source)
test.py calling inv.execute()
test.py calling out = Output.open_membuffer()
--- CompilerDriver.cpp: ireeCompilerOutputOpenMembuffer ---
test.py calling inv.output_vm_bytecode(out)
--- CompilerDriver.cpp: ireeCompilerInvocationOutputVMBytecode ---
test.py calling ireert.VmModule.wrap_buffer
--- CompilerDriver.cpp: ireeCompilerOutputMapMemory ---
--- vm.cc: VmModule::WrapBuffer, (edit) close_buffer: 0 ---
test.py calling ireert.load_vm_module
system_api.py load_vm_module --> load_vm_modules
system_api.py load_vm_modules
system_api.py: SystemContext::__init__
system_api.py: SystemContext::__init__ is *not* dynamic
system_api.py: SystemContext::__init__ setup self._bound_modules
test.py finished
--- CompilerDriver.cpp: ireeCompilerSourceDestroy start ---
--- CompilerDriver.cpp: ireeCompilerSourceDestroy finish ---
AddressSanitizer:DEADLYSIGNAL
=================================================================
==189798==ERROR: AddressSanitizer: SEGV on unknown address
stellaraccident commented 2 weeks ago

Let's fork this repro into an issue for wrap_buffer from the compiler API. I've seen evidence of something like this before but have had trouble getting it to repro.

ScottTodd commented 2 weeks ago

Sure. I'm not sure if there's actually an issue with index_put_ at all... possibly all red herrings in my testing that were caused by wrap_buffer. Can still start a more focused issue.

stellaraccident commented 2 weeks ago

Frustrating.

ScottTodd commented 2 weeks ago

This test looks quite similar to what I'm testing: https://github.com/iree-org/iree/blob/main/compiler/bindings/python/test/api/output_buffer_reference_test.py, and that is ASan-clean.

I'll see where the code diverges and try to file a more focused issue once I know more.

stellaraccident commented 2 weeks ago

If it turns out this is the issue, we can just rename this issue and not have another one. Was this the only thing wrong the whole time?

stellaraccident commented 2 weeks ago

This test looks quite similar to what I'm testing: https://github.com/iree-org/iree/blob/main/compiler/bindings/python/test/api/output_buffer_reference_test.py, and that is ASan-clean.

I'll see where the code diverges and try to file a more focused issue once I know more.

I mean, I've "fixed" this bug a couple of times. Something subtle is wrong and I think it needs a closer look at actual reference counts or something.

ScottTodd commented 2 weeks ago

Was this the only thing wrong the whole time?

Unclear. We originally hit crashes in decode() for the sharktank llama model in Python and native tools that we were trying to reduce. Some forms of reducing the test hit unsupported lowerings (broadcasting behavior). I think there's still a different crash involved somewhere since this latest line of debugging involves only the Python bindings.

ScottTodd commented 2 weeks ago
  • Building with CMake and setting PYTHONPATH was not enough to get updated C++ code in the runtime, at least when I already had iree-runtime installed from pip. Maybe uninstalling from pip would help. What I ended up doing was python -m pip wheel runtime/ -v to build a wheel then pip install iree_runtime-0.dev0+[...].whl

Aha! Uninstalling the python runtime wheel and just setting PYTHONPATH to the build worked, and now I get stack traces from ASan!

scotttodd@scotttodd-cpu:~/scratch/tests$ LD_PRELOAD=/usr/lib/llvm-14/lib/clang/14.0.0/lib/linux/libclang_rt.asan-x86_64.so ASAN_OPTIONS=detect_leaks=0 ASAN_SYMBOLIZER_PATH=/usr/lib/llvm-14/bin/llvm-symbolizer python asan_repro_debugging.py 
test.py calling Source.wrap_buffer(session, ...)
CompilerDriver.cpp: ireeCompilerSourceWrapBuffer, length: 167, isNullTerminated: 0 ---
test.py calling inv.parse_source(source)
test.py calling inv.execute()
test.py calling out = Output.open_membuffer()
CompilerDriver.cpp: ireeCompilerOutputOpenMembuffer 
test.py calling inv.output_vm_bytecode(out)
CompilerDriver.cpp: ireeCompilerInvocationOutputVMBytecode 
test.py calling out.map_memory()
ctypes_dl.py :: Output::map_memory
CompilerDriver.cpp: ireeCompilerOutputMapMemory 
test.py calling ireert.VmModule.wrap_buffer()
vm.cc: VmModule::WrapBuffer, close_buffer: 0 
vm.cc: VmModule::WrapBuffer, iree_vm_bytecode_module_create
VMFB Length = 5558
test.py calling ireert.load_vm_module
system_api.py load_vm_module --> load_vm_modules
system_api.py load_vm_modules
system_api.py: SystemContext::__init__
system_api.py: SystemContext::__init__ is *not* dynamic
system_api.py: SystemContext::__init__ setup self._bound_modules
test.py finished
CompilerDriver.cpp: ireeCompilerSourceDestroy start 
CompilerDriver.cpp: ireeCompilerSourceDestroy finish
AddressSanitizer:DEADLYSIGNAL
=================================================================
==229852==ERROR: AddressSanitizer: SEGV on unknown address 0x7f66510ff050 (pc 0x7f66efa5f25e bp 0x7fff9db6e9d0 sp 0x7fff9db6e950 T0)
==229852==The signal is caused by a READ memory access.
+   #0 0x7f66efa5f25e in __flatbuffers_soffset_read /home/scotttodd/iree/third_party/flatcc/include/flatcc/flatcc_endian.h:89:2
+   #1 0x7f66efa5f25e in __flatbuffers_soffset_read_from_pe /home/scotttodd/iree/third_party/flatcc/include/flatcc/flatcc_endian.h:89:2
+   #2 0x7f66efa5f25e in iree_vm_BytecodeModuleDef_exported_functions /home/scotttodd/iree-build/runtime/src/iree/schemas/bytecode_module_def_reader.h:693:1
+   #3 0x7f66efa5f25e in iree_vm_bytecode_module_lookup_function /home/scotttodd/iree/runtime/src/iree/vm/bytecode/module.c:292:9
+   #4 0x7f66efb5b497 in iree_vm_context_run_function /home/scotttodd/iree/runtime/src/iree/vm/context.c:77:26
+   #5 0x7f66efb5b497 in iree_vm_context_release_modules /home/scotttodd/iree/runtime/src/iree/vm/context.c:269:5
+   #6 0x7f66efb5acba in iree_vm_context_destroy /home/scotttodd/iree/runtime/src/iree/vm/context.c:357:5
+   #7 0x7f66ef9c0cbe in iree::python::ApiPtrAdapter<iree_vm_context_t>::Release(iree_vm_context_t*) /home/scotttodd/iree/runtime/bindings/python/./vm.h:42:47
+   #8 0x7f66ef9c0cbe in iree::python::ApiRefCounted<iree::python::VmContext, iree_vm_context_t>::Release() /home/scotttodd/iree/runtime/bindings/python/./binding.h:107:7
+   #9 0x7f66ef9c0cbe in iree::python::ApiRefCounted<iree::python::VmContext, iree_vm_context_t>::~ApiRefCounted() /home/scotttodd/iree/runtime/bindings/python/./binding.h:59:22
+   #10 0x7f66ef9e40f5 in nanobind::detail::inst_dealloc(_object*) /home/scotttodd/iree/.venv/lib/python3.11/site-packages/nanobind/src/nb_type.cpp:229:13
    #11 0x5af471  (/usr/bin/python3.11+0x5af471) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
    #12 0x4d8278  (/usr/bin/python3.11+0x4d8278) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
    #13 0x6557cf  (/usr/bin/python3.11+0x6557cf) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
    #14 0x654f99  (/usr/bin/python3.11+0x654f99) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
    #15 0x646d4d in Py_FinalizeEx (/usr/bin/python3.11+0x646d4d) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
    #16 0x64f5ea in Py_RunMain (/usr/bin/python3.11+0x64f5ea) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
    #17 0x61ee0c in Py_BytesMain (/usr/bin/python3.11+0x61ee0c) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
    #18 0x7f66f2229d8f  (/lib/x86_64-linux-gnu/libc.so.6+0x29d8f) (BuildId: c289da5071a3399de893d2af81d6a30c62646e1e)
    #19 0x7f66f2229e3f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x29e3f) (BuildId: c289da5071a3399de893d2af81d6a30c62646e1e)
    #20 0x61ec94 in _start (/usr/bin/python3.11+0x61ec94) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /home/scotttodd/iree/third_party/flatcc/include/flatcc/flatcc_endian.h:89:2 in __flatbuffers_soffset_read
==229852==ABORTING
ScottTodd commented 2 weeks ago

So the VM is trying to run the __deinit function on the module, but the module was already freed...?

https://github.com/iree-org/iree/blob/71c07faba52f527b08abc73c652ba140f1d8aa54/runtime/src/iree/vm/context.c#L269-L270

ScottTodd commented 2 weeks ago

Yeah, ASan is happy if I clear the loaded module before letting the program exit on its own:

loaded_module = ireert.load_vm_module(
    wrapped_buffer,
    config,
)
+loaded_module = None
ScottTodd commented 2 weeks ago

This passes ASan, finishing with no leaks or segfaults:

    instance = VmInstance()
    output = Output.open_membuffer()
    output.write(vmfb_contents)
    mapped_memory = output.map_memory()
    module = VmModule.wrap_buffer(instance, mapped_memory)
    context = VmContext(instance, modules=[module])

This crashes and trips ASan with a segfault on a read memory access:

    instance = VmInstance()
    output = Output.open_membuffer()
    output.write(vmfb_contents)
    mapped_memory = output.map_memory()
    module = VmModule.wrap_buffer(instance, mapped_memory)
    # note this line is different!
    loaded_module = load_vm_module(module)

The source for that different line is here: https://github.com/iree-org/iree/blob/71c07faba52f527b08abc73c652ba140f1d8aa54/runtime/bindings/python/iree/runtime/system_api.py#L247-L251

the SystemContext class creates BoundModule classes that retain a reference to the self SystemContext: https://github.com/iree-org/iree/blob/71c07faba52f527b08abc73c652ba140f1d8aa54/runtime/bindings/python/iree/runtime/system_api.py#L204-L206

Is that reference cycle throwing off the usual garbage collector / shutdown / destruction ordering?

benvanik commented 2 weeks ago

garbage collection, not-once-matty-matheson 1

(says an old school vb/c#/js dev)

ScottTodd commented 2 weeks ago

Adding a weakref here makes ASan happy for tests that just load a module and then exit. That's not quite the right fix though, since tests that actually use the module seem to then be missing the object (already gc'd?) :P

            self._bound_modules = BoundModules(
-               [(m.name, BoundModule(self, m)) for m in init_vm_modules]
+               [(m.name, BoundModule(weakref.ref(self), m)) for m in init_vm_modules]
            )
stellaraccident commented 2 weeks ago

We need to root cause where the ref isn't being retained... It shouldn't be possible to crash with pure python usage of the APIs.

This might actually be behind some other ghosts I've been carefully trying to catch for a while. There's a bit of art to tracking this to a root cause. I can't do it right now but could help.

You started down this path because of a specific failure scenario. Is this the root cause of that or just an incidental thing found along the way (just trying to understand whether we have more going on)?

ScottTodd commented 2 weeks ago

You started down this path because of a specific failure scenario. Is this the root cause of that or just an incidental thing found along the way (just trying to understand whether we have more going on)?

I believe it's incidental along the way, but I'd love to be wrong there.

The sequence was roughly:

  1. We observed that the llama model from sharktank was crashing during the decode() step: https://github.com/nod-ai/sharktank/issues/22#issuecomment-2151089923, https://github.com/rsuderman/sharktank/blob/d9b25900a069f8200948da322de693e5d6bae15f/shortfin/shortfin/llm/impl/service_v1_cli.py#L101-L106 . We suspected out of bounds write in the kv_cache relying on the in-place index_put_ (source here), but there were too many unknowns to say for sure.

    ~/Repos/iree/build/tools/iree-run-module --module=/tmp/batch_llama_v1.vmfb  --function=decode_bs4 --parameters=model=/media/rsuderman/Disk2/Models/llama.slycho.gguf --input=4x1xsi64=0 --input=4xsi64=1 --input=4xsi64=1 --input=4x1xsi64=0,1,2,3 --input=8x2662400xf16=0.0
    
    EXEC @decode_bs4
    Exception thrown at 0x000001AA8B4EBFF8 in iree-run-module.exe: 0xC0000005: Access violation writing location 0xFFFF9B96ECCF4300.
    
    iree-run-module.exe!iree_elf_call_i_ppp() Line 184 (d:\dev\projects\iree\runtime\src\iree\hal\local\elf\arch\x86_64_msvc.asm:184)
  2. We attempted to reduce the full model down to a smaller model and some unit tests for index_put_. Some of those failed to compile, due to https://github.com/llvm/torch-mlir/issues/3433 (unimplemented supported for broadcasting variants of the PyTorch index_put_ op in torch-mlir).

  3. Other unit tests produced correct outputs but crashed at runtime. The crash looked to be consistent enough to bisect across releases and I thought I had a culprit range of commits, but that was not the case. That led to this issue, which eventually narrowed in on the Python tests themselves hitting this buggy wrap_buffer shutdown / garbage collect behavior, independent of what the actual code in the test modules was doing.

So if this gets fixed, writing tests and exercising the code from Python will be more stable (especially with ASan enabled), but there is likely more debugging ahead - either going back to the full program or trying to build up component by component (e.g. verifying the behavior of just PagedKVCache.write_timestep and the other functions involved there).

ScottTodd commented 2 weeks ago

Still debugging this, reading through the changes in https://github.com/iree-org/iree/pull/15975

The iree_compiler_output_t Output from compiler/bindings/python/iree/compiler/api/ctypes_dl.py is hitting its weakref.finalize(pointer, lambda x: ..., self) call before the deallcator from VmModule VmModule::WrapBuffer runs.

https://github.com/iree-org/iree/blob/b4321ea20b0e45c4179f800a1b65df44bef9ee48/compiler/bindings/python/iree/compiler/api/ctypes_dl.py#L273-L290 https://github.com/iree-org/iree/blob/b4321ea20b0e45c4179f800a1b65df44bef9ee48/runtime/bindings/python/vm.cc#L289-L387

ScottTodd commented 2 weeks ago

https://docs.python.org/3/library/gc.html#gc.set_debug this looks useful...

Output with gc.set_debug(gc.DEBUG_STATS | gc.DEBUG_COLLECTABLE | gc.DEBUG_UNCOLLECTABLE) and a ton of printfs: https://gist.github.com/ScottTodd/453834cb5731598f6f8a15b27700d518

ScottTodd commented 2 weeks ago

I'm skeptical of this pattern:

pointer = (c_char * size).from_address(contents.value) 
weakref.finalize(pointer, lambda x: ..., self)
return pointer

I've made a number of attempts to add extra references, log what the garbage collector is doing, log which C++ constructors / destructors are running, etc. here: https://github.com/iree-org/iree/compare/main...ScottTodd:wrap-buffer-debugging?expand=1. Not sure what else to try. Reading various stackoverflow questions about python memoryview, from_address, garbage collection, and related systems now...

ScottTodd commented 2 weeks ago

Found another project using the same code pattern 🤔 https://github.com/GridTools/stencil_benchmarks/blob/33658cd68f3e2248611f6a845a39ae8b9684af78/stencil_benchmarks/benchmarks_collection/stencils/cuda_hip/api.py#L54-L69

ScottTodd commented 2 weeks ago

Also reading https://nanobind.readthedocs.io/en/latest/ownership.html now

I'm not seeing the destructors for BufferState or PyBufferRequest run before the weakref.finalize on Output.map_memory() runs, so they aren't holding onto a reference to the pointer / memoryview? https://github.com/iree-org/iree/blob/045bf3266bb2cd827557c31d2229e0739de7cb4a/runtime/bindings/python/vm.cc#L299 https://github.com/iree-org/iree/blob/045bf3266bb2cd827557c31d2229e0739de7cb4a/runtime/bindings/python/buffer_interop.h#L22

ScottTodd commented 2 weeks ago

I guess we're also interoping between nanobind and pybind here? The compiler uses pybind but the runtime uses nanobind.

stellaraccident commented 2 weeks ago

Yes, lots of interop. It might have been more effective to just randomly rotate the pointer and hope that sometimes it worked out.

ScottTodd commented 2 weeks ago

Tried some variations of py::keep_alive here: https://github.com/iree-org/iree/blob/045bf3266bb2cd827557c31d2229e0739de7cb4a/runtime/bindings/python/vm.cc#L945-L947

"Keep the buffer argument alive while the returned VmModule is alive"

Might have missed some syntax but that didn't seem to help.

ScottTodd commented 2 weeks ago

Using gc.get_referrers I can see that the weakref.finalize code in map_memory() is creating a reference to the Output object. I see no other references to that object or any references to the returned pointer though.

ScottTodd commented 2 weeks ago

Is there a way we can use from_buffer here instead of from_address? https://github.com/iree-org/iree/blob/045bf3266bb2cd827557c31d2229e0739de7cb4a/compiler/bindings/python/iree/compiler/api/ctypes_dl.py#L273-L290

I'm wondering if we can get this instance variable to be populated:

_objects This member is either None or a dictionary containing Python objects that need to be kept alive so that the memory block contents is kept valid. This object is only exposed for debugging; never modify the contents of this dictionary.

The code before https://github.com/iree-org/iree/pull/15975 used return memoryview((c_char * size).from_address(contents.value)) too.

stellaraccident commented 2 weeks ago

Whatever can be made to work yes. The big hammer is to add a native pybind module for interop

ScottTodd commented 2 weeks ago

Going to put this on hold for a bit.