nod-ai / iree-amd-aie

IREE plugin repository for the AMD AIE accelerator
Apache License 2.0
64 stars 29 forks source link

IR with multiple functions failing in compiler #130

Closed nirvedhmeshram closed 7 months ago

nirvedhmeshram commented 8 months ago

With an IR like this

func.func @matmul_8x32_16xi32_(%lhs: tensor<8x16xi32>, %rhs: tensor<16x32xi32>) -> tensor<8x32xi32> {
  %init_acc = tensor.empty() : tensor<8x32xi32>
  %c0_acc_type = arith.constant 0: i32
  %acc = linalg.fill ins(%c0_acc_type : i32) outs(%init_acc : tensor<8x32xi32>) -> tensor<8x32xi32>
  %result = linalg.matmul ins(%lhs, %rhs: tensor<8x16xi32>, tensor<16x32xi32>) outs(%acc: tensor<8x32xi32>) -> tensor<8x32xi32>
  return %result: tensor<8x32xi32>
}

func.func @matmul_8x16_16xi32_(%lhs: tensor<8x16xi32>, %rhs: tensor<16x16xi32>) -> tensor<8x16xi32> {
  %init_acc = tensor.empty() : tensor<8x16xi32>
  %c0_acc_type = arith.constant 0: i32
  %acc = linalg.fill ins(%c0_acc_type : i32) outs(%init_acc : tensor<8x16xi32>) -> tensor<8x16xi32>
  %result = linalg.matmul ins(%lhs, %rhs: tensor<8x16xi32>, tensor<16x16xi32>) outs(%acc: tensor<8x16xi32>) -> tensor<8x16xi32>
  return %result: tensor<8x16xi32>
}

The following compiler command sometimes works, but most of the times crashes and the segfault is not the same each time either

./tools/iree-compile build-matmul/matmul_i32_i32_small_amd-aie_xrt_matmuls.mlir --iree-hal-target-backends=amd-aie --iree-amd-aie-peano-install-dir=<path to peano>l --iree-amd-aie-mlir-aie-install-dir=<path to mlir-aie> --iree-amd-aie-vitis-install-dir=<path to vitis> -o test.vmfb

Here are some of the segfaults

Note that if any of the functions is compiled by itself there are no problems.

nirvedhmeshram commented 8 months ago

fyi @MaheshRavishankar @yzhang93 @Abhishek-Varma Also, @ScottTodd @stellaraccident we were wondering if the way the plugin is added could have anything to do with this?

stellaraccident commented 8 months ago

Sounds like you have function scoped passes that are mutating IR outside of their scope. Run with threading disabled to verify the hypothesis.

nirvedhmeshram commented 8 months ago

Sounds like you have function scoped passes that are mutating IR outside of their scope. Run with threading disabled to verify the hypothesis.

Yes, it is not crashing with -mlir-disable-threading , will work on narrowing which are the culprit passes, @Abhishek-Varma may be we can team up on this.

stellaraccident commented 8 months ago

Likely if you run the compiler built with tsan, it will pinpoint for you.

Abhishek-Varma commented 8 months ago

I first tried the normal compilation flow which we're using for e2e lit tests - no issues there. It passes.

Then switched over to the above compilation command with Peano/Vitis/etc - the error indeed is at different points besides the above stack trace - in my case though I get verification errors (so not able to fetch/replicate the above stack trace).

I'm building IREE with Tsan on xcorad and the build itself is taking time (still at [196/5206]) and have multiple FAILED: instances (for `llvm-project/lib/Target/AMDGPU/`).

Is there any "faster" way to debug this? @stellaraccident @MaheshRavishankar

newling commented 8 months ago

I get UB on my local machine with iree-compile repro.mlir --iree-hal-target-backends=amd-aie -o test.vmfb . I get a consistent repro.mlir:13:13: error: failed to serialize executables with threading disabled.

I can also get UB with (and also resolved with threading disabled) 2 steps:

iree-compile --iree-hal-target-backends=amd-aie --compile-to=executable-sources repro.mlir  > part1.mlir 
iree-opt part1.mlir  --iree-amdaie-use-pipeline=simple-pack --pass-pipeline="builtin.module(hal.executable(hal.executable.variant(iree-hal-translate-target-executable-variants{target=amd-aie})))"

My IREE build has CMAKE_BUILD_TYPE:STRING=RelWithDebInfo and IREE_ENABLE_ASSERTIONS:BOOL=ON

@Abhishek-Varma ping me if you'd me to try something.

nirvedhmeshram commented 8 months ago

I get UB on my local machine with iree-compile repro.mlir --iree-hal-target-backends=amd-aie -o test.vmfb . I get a consistent repro.mlir:13:13: error: failed to serialize executables with threading disabled.

I can also get UB with (and also resolved with threading disabled) 2 steps:

iree-compile --iree-hal-target-backends=amd-aie --compile-to=executable-sources repro.mlir  > part1.mlir 
iree-opt part1.mlir  --iree-amdaie-use-pipeline=simple-pack --pass-pipeline="builtin.module(hal.executable(hal.executable.variant(iree-hal-translate-target-executable-variants{target=amd-aie})))"

My IREE build has CMAKE_BUILD_TYPE:STRING=RelWithDebInfo and IREE_ENABLE_ASSERTIONS:BOOL=ON

@Abhishek-Varma ping me if you'd me to try something.

I was hit with the same build error Abhishek got for tsan (since we are using same machine/compiler) , @newling that error is a general high level error you get when any of the passes fails. Is there any other errors along with it?

newling commented 8 months ago

@nirvedhmeshram 4 scenarios described above: {With threading, without threading} x {compile pipeline 1, compile pipeline 2}

where

compile pipeline 1:

iree-compile repro.mlir --iree-hal-target-backends=amd-aie -o test.vmfb

compile pipeline 2:

iree-compile --iree-hal-target-backends=amd-aie --compile-to=executable-sources repro.mlir  > part1.mlir 
iree-opt part1.mlir  --iree-amdaie-use-pipeline=simple-pack --pass-pipeline="builtin.module(hal.executable(hal.executable.variant(iree-hal-translate-target-executable-variants{target=amd-aie})))"

and threading is controlled with --mlir-disable-threading

With threading, compile pipeline 1:

UB about 50% of the time. Examples

Please report issues to https://github.com/openxla/iree/issues and include the crash backtrace.
iree-compile: /home/jamesn/iree/third_party/llvm-project/llvm/include/llvm/Support/Casting.h:566: decltype(auto) llvm::cast(const From &) [To = mlir::StringAttr, From = mlir::Attribute]: Assertion `isa<To>(Val) && "cast<Ty>() argument of incompatible type!"' failed.
Aborted
Please report issues to https://github.com/openxla/iree/issues and include the crash backtrace.
iree-compile: iree/third_party/llvm-project/mlir/include/mlir/IR/Operation.h:1001: detail::OpResultImpl *mlir::Operation::getOpResultImpl(unsigned int): Assertion `resultNumber < getNumResults() && "Result number is out of range for operation"' failed.
Aborted
iree-compile: /home/jamesn/iree/third_party/llvm-project/llvm/include/llvm/Support/Casting.h:566: decltype(auto) llvm::cast(const From &) [To = mlir::StringAttr, From = mlir::Attribute]: Assertion `isa<To>(Val) && "cast<Ty>() argument of incompatible type!"' failed.
Please report issues to https://github.com/openxla/iree/issues and include the crash backtrace.
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
0  libIREECompiler.so 0x00007f23b0fabcc7 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) + 39
1  libIREECompiler.so 0x00007f23b0fa9ef0 llvm::sys::RunSignalHandlers() + 80
2  libIREECompiler.so 0x00007f23b0fac38a
3  libc.so.6          0x00007f23aaaff520
4  libc.so.6          0x00007f23aab539fc pthread_kill + 300
5  libc.so.6          0x00007f23aaaff476 raise + 22
6  libc.so.6          0x00007f23aaae57f3 abort + 211
7  libc.so.6          0x00007f23aaae571b
8  libc.so.6          0x00007f23aaaf6e96
9  libIREECompiler.so 0x00007f23b0ff79cc
10 libIREECompiler.so 0x00007f23b0efe400
11 libIREECompiler.so 0x00007f23b0fb4338
12 libIREECompiler.so 0x00007f23b10a4a7d
13 libIREECompiler.so 0x00007f23b10a52c0
14 libIREECompiler.so 0x00007f23b10a4933 mlir::verify(mlir::Operation*, bool) + 19
15 libIREECompiler.so 0x00007f23b1143d40 mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) + 1328
16 libIREECompiler.so 0x00007f23b1144208 mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) + 328
17 libIREECompiler.so 0x00007f23b1149663
18 libIREECompiler.so 0x00007f23b114576b mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool) + 2315
19 libIREECompiler.so 0x00007f23b1143c20 mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) + 1040
20 libIREECompiler.so 0x00007f23b1144208 mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) + 328
21 libIREECompiler.so 0x00007f23b1148a31
22 libIREECompiler.so 0x00007f23b25b9452
23 libIREECompiler.so 0x00007f23b1143a85 mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) + 629
24 libIREECompiler.so 0x00007f23b1144208 mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) + 328
25 libIREECompiler.so 0x00007f23b1149663
26 libIREECompiler.so 0x00007f23b114576b mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool) + 2315
27 libIREECompiler.so 0x00007f23b1143c20 mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) + 1040
28 libIREECompiler.so 0x00007f23b1144208 mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) + 328
29 libIREECompiler.so 0x00007f23b1148a31
30 libIREECompiler.so 0x00007f23b25b9f55
31 libIREECompiler.so 0x00007f23b1143a85 mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) + 629
32 libIREECompiler.so 0x00007f23b1144208 mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) + 328
33 libIREECompiler.so 0x00007f23b1149663
34 libIREECompiler.so 0x00007f23b114971f
35 libIREECompiler.so 0x00007f23b10a7476
36 libIREECompiler.so 0x00007f23b0f69d31 llvm::ThreadPool::processTasks(llvm::ThreadPoolTaskGroup*) + 993
37 libIREECompiler.so 0x00007f23b0f6a9dc
38 libc.so.6          0x00007f23aab51ac3
39 libc.so.6          0x00007f23aabe3850
Aborted

When there is an actual error thrown during MLIR lowering (the other 50% of the time) it always looks like

}) {sym_name = "amdaie_xclbin_fb", target = #hal.executable.target<"amd-aie", "amdaie-xclbin-fb", {target_arch = "chip-tbd"}>} : () -> ()
  %result = linalg.matmul ins(%lhs, %rhs: tensor<8x16xi32>, tensor<16x16xi32>) outs(%acc: tensor<8x16xi32>) -> tensor<8x16xi32>
            ^
issue_130_reproducer.mlir:13:13: error: failed to translate executables
  %result = linalg.matmul ins(%lhs, %rhs: tensor<8x16xi32>, tensor<16x16xi32>) outs(%acc: tensor<8x16xi32>) -> tensor<8x16xi32>
            ^
issue_130_reproducer.mlir:9:1: note: called from
func.func @matmul_8x16_16xi32_(%lhs: tensor<8x16xi32>, %rhs: tensor<16x16xi32>) -> tensor<8x16xi32> {
^
issue_130_reproducer.mlir:13:13: note: see current operation:
"hal.executable"() ({

Without threading, compile pipeline 1:

Always the same error as above (failed to translate executables).

With threading, compile pipeline 2:

UB about 50% of the time. Examples:

Segmentation fault
iree-opt: iree/third_party/llvm-project/mlir/include/mlir/IR/Operation.h:1001: detail::OpResultImpl *mlir::Operation::getOpResultImpl(unsigned int): Assertion `resultNumber < getNumResults() && "Result number is out of range for operation"' failed.
Please report issues to https://github.com/openxla/iree/issues and include the crash backtrace.
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
0  libIREECompiler.so 0x00007f9fc9229cc7 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) + 39
1  libIREECompiler.so 0x00007f9fc9227ef0 llvm::sys::RunSignalHandlers() + 80
2  libIREECompiler.so 0x00007f9fc922a38a
3  libc.so.6          0x00007f9fc2d7d520
4  libc.so.6          0x00007f9fc2dd19fc pthread_kill + 300
5  libc.so.6          0x00007f9fc2d7d476 raise + 22
6  libc.so.6          0x00007f9fc2d637f3 abort + 211
7  libc.so.6          0x00007f9fc2d6371b
8  libc.so.6          0x00007f9fc2d74e96
9  libIREECompiler.so 0x00007f9fca4cc1f4
10 libIREECompiler.so 0x00007f9fc917d009
11 libIREECompiler.so 0x00007f9fc917d009
12 libIREECompiler.so 0x00007f9fc917d009
13 libIREECompiler.so 0x00007f9fca4baf08
14 libIREECompiler.so 0x00007f9fc93c1a85 mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) + 629
15 libIREECompiler.so 0x00007f9fc93c2208 mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) + 328
16 libIREECompiler.so 0x00007f9fc93c7663
17 libIREECompiler.so 0x00007f9fc93c376b mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool) + 2315
18 libIREECompiler.so 0x00007f9fc93c1c20 mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) + 1040
19 libIREECompiler.so 0x00007f9fc93c2208 mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) + 328
20 libIREECompiler.so 0x00007f9fc93c6a31
21 libIREECompiler.so 0x00007f9fca837452
22 libIREECompiler.so 0x00007f9fc93c1a85 mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) + 629
23 libIREECompiler.so 0x00007f9fc93c2208 mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) + 328
24 libIREECompiler.so 0x00007f9fc93c7663
25 libIREECompiler.so 0x00007f9fc93c376b mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool) + 2315
26 libIREECompiler.so 0x00007f9fc93c1c20 mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) + 1040
27 libIREECompiler.so 0x00007f9fc93c2208 mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) + 328
28 libIREECompiler.so 0x00007f9fc93c7663
29 libIREECompiler.so 0x00007f9fc93c771f
30 libIREECompiler.so 0x00007f9fc9325476
31 libIREECompiler.so 0x00007f9fc91e7d31 llvm::ThreadPool::processTasks(llvm::ThreadPoolTaskGroup*) + 993
32 libIREECompiler.so 0x00007f9fc91e89dc
33 libc.so.6          0x00007f9fc2dcfac3
34 libc.so.6          0x00007f9fc2e61850
Aborted

Without threading, compile pipeline 2:

Compiles perfectly (no UB, no error).

nirvedhmeshram commented 8 months ago

@newling thanks for the detailed senarios, is this with tsan enabled, I am seeing some of the same kind of errors without tsan, btw you can do something like for iree-build directory ninja llvm-symbolizer and then export LLVM_SYMBOLIZER_PATH=<path to ire build>/llvm-project/bin/llvm-symbolizer to see symbols in your traces, (although not helping much in this specific failure)

newling commented 8 months ago

@nirvedhmeshram I haven't built with tsan yet, I'll start that now (on a more powerful machine...)

newling commented 8 months ago

I resolved the 3 warnings which tsan threw up: https://github.com/Xilinx/mlir-air/pull/410

But I'm still seeing the UB. So to summarize:

1) no UB with mlir-disable-threading 2) UB without mlir-disable-threading 3) no longer any warnings with tsan with mlir-air #410

@nirvedhmeshram / @stellaraccident if there's something you'd try next please let me know and I'll try it in the morning

newling commented 8 months ago

Bisection search by commenting out passes suggests a segfault is caused by passManager.addPass(xilinx::air::createAIRDependencyPass());

newling commented 8 months ago

PR https://github.com/Xilinx/mlir-air/pull/411 gets past the air dependency pass

MaheshRavishankar commented 7 months ago

Is this fixed now?

nirvedhmeshram commented 7 months ago

Yes, I am able to compile now, there is a runtime issue I am seeing when multiple dispactches are present, but I will open a separate issue for that after proving it in CI.