modularml / mojo

The Mojo Programming Language
https://docs.modular.com/mojo/manual/
Other
23.37k stars 2.6k forks source link

[BUG]: C++ exceptions from Python causing segfault in Mojo #986

Open jackos opened 1 year ago

jackos commented 1 year ago

Bug description

If a C++ exception is thrown, that would normally result in Python catching the error and reporting it correctly, it results in a seg fault. e.g. the shape is incompatible here:

from python import Python

fn main() raises:
    let torch = Python.import_module("torch")
    let a = torch.randn(4, 2)
    let b = torch.randn(2, 2)
    let c = a * b
    print(c)

Results in seg fault:

 #0 0x0000000104ae1f80 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/Users/jack/src/modular/.derived/build-release/bin/mojo+0x1000c5f80)
 #1 0x0000000104ae00e0 llvm::sys::RunSignalHandlers() (/Users/jack/src/modular/.derived/build-release/bin/mojo+0x1000c40e0)
 #2 0x0000000104ae261c SignalHandler(int) (/Users/jack/src/modular/.derived/build-release/bin/mojo+0x1000c661c)
 #3 0x00000001814d6a24 (/usr/lib/system/libsystem_platform.dylib+0x18042ea24)
 #4 0x000000028003e780 
 #5 0x000000028003e780 
 #6 0x0000000104e32730 M::KGEN::ExecutionEngine::runProgram(llvm::StringRef, llvm::StringRef, llvm::function_ref<M::ErrorOrSuccess (void*)>) (/Users/jack/src/modular/.derived/build-release/bin/mojo+0x100416730)
 #7 0x0000000104a3c254 run(M::State const&) (/Users/jack/src/modular/.derived/build-release/bin/mojo+0x100020254)
 #8 0x0000000104a2511c main (/Users/jack/src/modular/.derived/build-release/bin/mojo+0x10000911c)
 #9 0x000000018114ff28 
[1]    27565 segmentation fault  mojo main.mojo

Where doing this in Python returns the correct error:

import torch

a = torch.randn(4, 2)
b = torch.randn(2, 2)
c = a * b
print(c)
Traceback (most recent call last):
  File "/Users/jack/src/pytorch-test/main.py", line 5, in <module>
    c = a * b
RuntimeError: The size of tensor a (4) must match the size of tensor b (2) at non-singleton dimension 0

May be related to all projects using C++ exceptions, or may just be the way PyTorch handles them being incompatible.

Discovered from: https://github.com/modularml/mojo/issues/974

System information

- What OS did you do install Mojo on? reproducable on macOS and Linux
- Provide version information for Mojo by pasting the output of `mojo -v`: mojo 0.4.0 (3ad45a54)
- Provide Modular CLI version by pasting the output of `modular -v`: modular 0.4.0 (3ad45a54)
Mogball commented 1 year ago

Thanks for filing! The approach here is tricky, because we don't build our C++ code with exceptions. We'll keep this on the radar.

ihnorton commented 1 year ago

we don't build our C++ code with exceptions

FWIW, in general the exception shouldn't escape the pytorch/pybind11 module boundary. pytorch uses pybind11 exception translation, which should catch C++ exceptions and convert them to CPython exceptions.

This little demo seems to do what I expect: in a mojo try block, exceptions from a pybind11-wrapped function are caught; without a mojo try block the exception causes program exit.

It looks like the issue here may be that the backtrace handler (?) is trying to call __str__ on a pyobject which no longer exists (output below is: running the OP's code under lldb on github codespaces with mojo 0.3.1):

``` @ihnorton ➜ /workspaces/mojo-playground/test_throw (main) $ mojo build orig.mojo @ihnorton ➜ /workspaces/mojo-playground/test_throw (main) $ mojo lldb -- orig Current executable set to '/workspaces/mojo-playground/test_throw/orig' (x86_64). (lldb) b __cxa_throw Breakpoint 1: no locations (pending). WARNING: Unable to resolve breakpoint to any actual locations. (lldb) r Process 31128 launched: '/workspaces/mojo-playground/test_throw/orig' (x86_64) 1 location added to breakpoint 1 Process 31128 stopped and restarted: thread 1 received signal: SIGCHLD Process 31128 stopped * thread #1, name = 'orig', stop reason = breakpoint 1.1 frame #0: 0x00007ffff7e2c650 libstdc++.so.6`__cxa_throw libstdc++.so.6`__cxa_throw: -> 0x7ffff7e2c650 <+0>: endbr64 0x7ffff7e2c654 <+4>: pushq %r13 0x7ffff7e2c656 <+6>: movq %rdx, %r13 0x7ffff7e2c659 <+9>: pushq %r12 (lldb) bt * thread #1, name = 'orig', stop reason = breakpoint 1.1 * frame #0: 0x00007ffff7e2c650 libstdc++.so.6`__cxa_throw frame #1: 0x00007fff827bd39b libc10.so`c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 148 frame #2: 0x00007fffaa1f1214 libtorch_cpu.so`at::infer_size_dimvector(c10::ArrayRef, c10::ArrayRef) + 932 frame #3: 0x00007fffaa284605 libtorch_cpu.so`at::TensorIteratorBase::compute_shape(at::TensorIteratorConfig const&) + 261 frame #4: 0x00007fffaa285929 libtorch_cpu.so`at::TensorIteratorBase::build(at::TensorIteratorConfig&) + 89 frame #5: 0x00007fffaa286f22 libtorch_cpu.so`at::TensorIteratorBase::build_borrowing_binary_op(at::TensorBase const&, at::TensorBase const&, at::TensorBase const&) + 178 frame #6: 0x00007fffab45107a libtorch_cpu.so`at::(anonymous namespace)::wrapper_CPU_mul_Tensor(at::Tensor const&, at::Tensor const&) + 74 frame #7: 0x00007fffab4510e0 libtorch_cpu.so`c10::impl::wrap_kernel_functor_unboxed_, at::Tensor, c10::guts::typelist::typelist>, at::Tensor (at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) + 16 frame #8: 0x00007fffaad7a27e libtorch_cpu.so`at::_ops::mul_Tensor::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) + 110 frame #9: 0x00007fffac8c958d libtorch_cpu.so`torch::autograd::VariableType::(anonymous namespace)::mul_Tensor(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) + 301 frame #10: 0x00007fffac8ca013 libtorch_cpu.so`c10::impl::wrap_kernel_functor_unboxed_, at::Tensor, c10::guts::typelist::typelist>, at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) + 19 frame #11: 0x00007fffaadd78b1 libtorch_cpu.so`at::_ops::mul_Tensor::call(at::Tensor const&, at::Tensor const&) + 353 frame #12: 0x00007fffc1d145a6 libtorch_python.so`torch::autograd::THPVariable_mul(_object*, _object*, _object*) + 614 frame #13: 0x00007fffc1d146f7 libtorch_python.so`_object* torch::autograd::TypeError_to_NotImplemented_<&torch::autograd::THPVariable_mul(_object*, _object*, _object*)>(_object*, _object*, _object*) + 7 frame #14: 0x00007ffff77f9dab libpython3.10.so.1.0`cfunction_call(func=0x00007ffff7034130, args=, kwargs=) at methodobject.c:543 (lldb) c Process 31128 resuming Process 31128 stopped * thread #1, name = 'orig', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x8) frame #0: 0x00007ffff78150ec libpython3.10.so.1.0`PyObject_GetAttrString(v=0x0000000000000000, name="__str__") at object.c:810 (lldb) bt * thread #1, name = 'orig', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x8) * frame #0: 0x00007ffff78150ec libpython3.10.so.1.0`PyObject_GetAttrString(v=0x0000000000000000, name="__str__") at object.c:810 frame #1: 0x00005555555591d4 orig`$python::$cpython::CPython::PyObject_GetAttrString($python::$cpython::CPython&,$python::$cpython::PyObjectPtr,$builtin::$stringref::StringRef) + 52 frame #2: 0x000055555555afa2 orig`main + 2850 frame #3: 0x00007ffff7a48083 libc.so.6`__libc_start_main(main=(orig`main), argc=1, argv=0x00007fffffffd198, init=, fini=, rtld_fini=, stack_end=0x00007fffffffd188) at libc-start.c:308:16 frame #4: 0x000055555555874e orig`_start + 46 (lldb) ``` (without capability to printstrument the interop and repl layers I'll stop there for now 😅)
jackos commented 5 months ago

Update: now not a segfault, returning <NULL> instead of raising an error