iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.79k stars 604 forks source link

Segfault in iree_elf_call_i_ppp #11636

Closed sogartar closed 1 year ago

sogartar commented 1 year ago

What happened?

I ran into a segmentation fault when trying to run mnist_train.mlir.tar.gz.

The stack trace before getting the error is

#0  iree_elf_call_i_ppp (symbol_ptr=0x7ffff72636e0, a0=0x379ba8, a1=0x7fffffff8c50, a2=0x7fffffff8040) at /home/petkantchin/ws/iree/repo/runtime/src/iree/hal/local/elf/arch/x86_64.c:206
#1  0x00000000002edb59 in iree_hal_elf_executable_issue_call (base_executable=0x379b70, ordinal=134, dispatch_state=0x7fffffff8c50, workgroup_state=0x7fffffff8040, worker_id=0) at /home/petkantchin/ws/iree/repo/runtime/src/iree/hal/local/loaders/embedded_elf_loader.c:305
#2  0x00000000002f6c1b in iree_hal_local_executable_issue_call (executable=0x379b70, ordinal=134, dispatch_state=0x7fffffff8c50, workgroup_state=0x7fffffff8040, worker_id=0) at /home/petkantchin/ws/iree/repo/runtime/src/iree/hal/local/local_executable.c:58
#3  0x00000000002f6d72 in iree_hal_local_executable_issue_dispatch_inline (executable=0x379b70, ordinal=134, dispatch_state=0x7fffffff8c50, processor_id=8, local_memory=...) at /home/petkantchin/ws/iree/repo/runtime/src/iree/hal/local/local_executable.c:99
#4  0x00000000002ec0ed in iree_hal_inline_command_buffer_dispatch (base_command_buffer=0x7fffffff82d0, executable=0x379b70, entry_point=134, workgroup_x=1, workgroup_y=1, workgroup_z=1) at /home/petkantchin/ws/iree/repo/runtime/src/iree/hal/local/inline_command_buffer.c:545
#5  0x00000000002730c6 in iree_hal_command_buffer_dispatch (command_buffer=0x7fffffff82d0, executable=0x379b70, entry_point=134, workgroup_x=1, workgroup_y=1, workgroup_z=1) at /home/petkantchin/ws/iree/repo/runtime/src/iree/hal/command_buffer.c:446
#6  0x000000000028e9b7 in iree_hal_deferred_command_buffer_apply_dispatch (target_command_buffer=0x7fffffff82d0, binding_table=..., cmd=0xab8c40) at /home/petkantchin/ws/iree/repo/runtime/src/iree/hal/utils/deferred_command_buffer.c:781
#7  0x000000000028e4b9 in iree_hal_deferred_command_buffer_apply (base_command_buffer=0x379a40, target_command_buffer=0x7fffffff82d0, binding_table=...) at /home/petkantchin/ws/iree/repo/runtime/src/iree/hal/utils/deferred_command_buffer.c:928
#8  0x000000000028cac0 in iree_hal_sync_device_apply_deferred_command_buffers (device=0x378580, command_buffer_count=1, command_buffers=0x7fffffff8ec0) at /home/petkantchin/ws/iree/repo/runtime/src/iree/hal/drivers/local_sync/sync_device.c:328
#9  0x000000000028c6a0 in iree_hal_sync_device_queue_execute (base_device=0x378580, queue_affinity=18446744073709551615, wait_semaphore_list=..., signal_semaphore_list=..., command_buffer_count=1, command_buffers=0x7fffffff8ec0) at /home/petkantchin/ws/iree/repo/runtime/src/iree/hal/drivers/local_sync/sync_device.c:359
#10 0x0000000000275d29 in iree_hal_device_queue_execute (device=0x378580, queue_affinity=18446744073709551615, wait_semaphore_list=..., signal_semaphore_list=..., command_buffer_count=1, command_buffers=0x7fffffff8ec0) at /home/petkantchin/ws/iree/repo/runtime/src/iree/hal/device.c:237
#11 0x0000000000306037 in iree_hal_module_device_queue_execute (stack=0x7fffffffb290, module=0x378630, state=0x378810, args=0x7fffffff9290, rets=0x7fffffff9290) at /home/petkantchin/ws/iree/repo/runtime/src/iree/modules/hal/module.c:992
#12 0x0000000000354eaf in iree_vm_shim_rIrrCrD_v (stack=0x7fffffffb290, flags=1, args_storage=..., rets_storage=..., target_fn=0x305de0 <iree_hal_module_device_queue_execute>, module=0x378630, module_state=0x378810) at /home/petkantchin/ws/iree/repo/runtime/src/iree/vm/shims.c:68
#13 0x000000000034f5dd in iree_vm_native_module_issue_call (module=0x378630, stack=0x7fffffffb290, callee_frame=0x7fffffffb890, flags=1, args_storage=..., rets_storage=...) at /home/petkantchin/ws/iree/repo/runtime/src/iree/vm/native_module.c:324
#14 0x000000000034f198 in iree_vm_native_module_begin_call (self=0x378630, stack=0x7fffffffb290, call=...) at /home/petkantchin/ws/iree/repo/runtime/src/iree/vm/native_module.c:378
#15 0x0000000000323dad in iree_vm_bytecode_issue_import_call (stack=0x7fffffffb290, call=..., cconv_results=..., dst_reg_list=0x7ffff72cd9a2, out_caller_frame=0x7fffffffaeb0, out_caller_registers=0x7fffffffaef0) at /home/petkantchin/ws/iree/repo/runtime/src/iree/vm/bytecode_dispatch.c:488
#16 0x0000000000322b3d in iree_vm_bytecode_call_import_variadic (stack=0x7fffffffb290, module_state=0x378970, import_ordinal=2147483668, caller_registers=..., segment_size_list=0x7ffff72cd98a, src_reg_list=0x7ffff72cd996, dst_reg_list=0x7ffff72cd9a2, out_caller_frame=0x7fffffffaeb0, out_caller_registers=0x7fffffffaef0) at /home/petkantchin/ws/iree/repo/runtime/src/iree/vm/bytecode_dispatch.c:647
#17 0x000000000031c862 in iree_vm_bytecode_dispatch (stack=0x7fffffffb290, module=0x378100, current_frame=0x7fffffffb300, regs=..., call_results=...) at /home/petkantchin/ws/iree/repo/runtime/src/iree/vm/bytecode_dispatch.c:1722
#18 0x0000000000311f16 in iree_vm_bytecode_dispatch_begin (stack=0x7fffffffb290, module=0x378100, call=..., cconv_arguments=..., cconv_results=...) at /home/petkantchin/ws/iree/repo/runtime/src/iree/vm/bytecode_dispatch.c:674
#19 0x000000000030c11a in iree_vm_bytecode_module_begin_call (self=0x378100, stack=0x7fffffffb290, call=...) at /home/petkantchin/ws/iree/repo/runtime/src/iree/vm/bytecode_module.c:1110
#20 0x00000000003481f3 in iree_vm_begin_invoke (state=0x7fffffffb258, context=0x378790, function=..., flags=0, policy=0x0, inputs=0x3797d0, host_allocator=...) at /home/petkantchin/ws/iree/repo/runtime/src/iree/vm/invocation.c:456
#21 0x0000000000347a3a in iree_vm_invoke (context=0x378790, function=..., flags=0, policy=0x0, inputs=0x3797d0, outputs=0x379950, host_allocator=...) at /home/petkantchin/ws/iree/repo/runtime/src/iree/vm/invocation.c:281
#22 0x000000000025b725 in iree::(anonymous namespace)::Run (out_exit_code=0x7fffffffd79c) at /home/petkantchin/ws/iree/repo/tools/iree-run-module-main.cc:142
#23 0x000000000025ad12 in main (argc=1, argv=0x7fffffffd8a8) at /home/petkantchin/ws/iree/repo/tools/iree-run-module-main.cc:212

Steps to reproduce your issue

  1. Compilation command
    
    iree-compile \
    mnist_train.mlir \
    --iree-input-type=mhlo \
    --iree-hal-target-backends=llvm-cpu \
    -o mnist_train.vmfb
2. Run command

iree-run-module \ --module_file=mnist_train.vmfb \ --device=local-sync \ --entry_function=initialize \ "--function_input=2xui32=[232, 843]"



### What component(s) does this issue relate to?

Runtime

### Version information

With git commit e5d71f5a0b2635cfaf5153364cc699434ac92eb7 compiled in debug.

### Additional context

If anyone is curious this model comes from the iree-jax [MNIST training example](https://github.com/iree-org/iree-jax/blob/5d171a3f9b69c5fdf3d8a7c5d296867026b4d87d/examples/mnist_export.py).
benvanik commented 1 year ago

This is likely a codegen issue. You can try debugging it by enabling ASAN; our docs seem to be rather poor on that but here's the gist:

benvanik commented 1 year ago

(docs are https://github.com/iree-org/iree/blob/67c04b9de75f40b0ad83b949fd0f5c92a6b74dd7/docs/developers/developing_iree/sanitizers.md but the ASAN section doesn't mention the iree-compile flags and the TSAN section while similar is very focused on internal tests and such and not what you do when compiling your own programs)

sogartar commented 1 year ago

I tried compiling the module with

iree-compile \
    mnist_train.mlir \
    --iree-input-type=mhlo \
    --iree-hal-target-backends=llvm-cpu \
    --iree-llvm-sanitize=address \
    --iree-llvm-link-embedded=false \
    -o mnist_train.vmfb

I got this error

ld.lld: error: relocation R_X86_64_PC32 cannot be used against symbol __asan_option_detect_stack_use_after_return; recompile with -fPIC
>>> defined in /tmp/mnist_module_linked_llvm_cpu-e64538.o
>>> referenced by mnist_module_linked_llvm_cpu
>>>               /tmp/mnist_module_linked_llvm_cpu-e64538.o:(update_dispatch_18_matmul_1024x784x128)
Linking failed; escaped command line returned exit code 256:

/usr/local/bin/ld.lld -o /tmp/mnist_module_linked_llvm_cpu-e64538.so -nostdlib -static -shared /tmp/mnist_module_linked_llvm_cpu-e64538.o

I think I have to pass -fPIC to the module object file compilation flags.

sogartar commented 1 year ago

It seems that PIC is already specified https://github.com/iree-org/iree/blob/e5d71f5a0b2635cfaf5153364cc699434ac92eb7/compiler/src/iree/compiler/Dialect/HAL/Target/LLVM/LLVMIRPasses.cpp#L39

MaheshRavishankar commented 1 year ago

That seems to be a separate issue. Can you try dropping iree-llvm-link-embedded=false.

sogartar commented 1 year ago

Without it I get

lld: error: undefined symbol: __asan_stack_free_8
>>> referenced by mnist_module_linked_llvm_cpu
>>>               /tmp/mnist_module_linked_llvm_cpu-ab1148.o:(update_dispatch_18_matmul_1024x784x128)
Linking failed; escaped command line returned exit code 256:

LLD_VERSION=IREE /home/petkantchin/ws/iree/build/ninja/Debug/third_party/llvm-project/llvm/bin/lld -flavor gnu -o /tmp/mnist_module_linked_llvm_cpu-ab1148.so --build-id=none -nostdlib -static -shared --no-undefined --no-allow-shlib-undefined --allow-multiple-definition --gc-sections -z now -z relro --discard-all --icf=all --ignore-data-address-equality --ignore-function-address-equality --hash-style=sysv /tmp/mnist_module_linked_llvm_cpu-ab1148.o

I think it needs to link to libasan.

sogartar commented 1 year ago

I am closing this. It seems it got fixed somehow. I don't see this problem on 8bedd4b249caa06ed346557214c016fe9b685dbf.

benvanik commented 1 year ago

This was likely related to the stack issues that we've recently been squashing. Glad it's working now - please let us know if you hit more issues of this type and we can dig in!