llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.25k stars 12.07k forks source link

MLIR rocm tests (two) failure on AMD RX 7900 XTX #63189

Open bondhugula opened 1 year ago

bondhugula commented 1 year ago

The latest official Git version at 18cc07aa07f6784cc59a4b4cfe33522867805586 (Jun 8) has two ROCM tests in the check-mlir suite failing on a modern AMD Radeon GPU - the RX 7900 XTX (gfx1100) with ROCM 5.4.3. The remaining four in Integration/GPU/ROCM/ pass.

********************
FAIL: MLIR :: Integration/GPU/ROCM/vector-transferops.mlir (3 of 6)
******************** TEST 'MLIR :: Integration/GPU/ROCM/vector-transferops.mlir' FAILED ********************
Script:
--
: 'RUN: at line 1';   /home/uday/llvm-project-upstream/build/bin/mlir-opt /home/uday/llvm-project-upstream/mlir/test/Integration/GPU/ROCM/vector-transferops.mlir  | /home/uday/llvm-project-upstream/build/bin/mlir-opt -convert-scf-to-cf  | /home/uday/llvm-project-upstream/build/bin/mlir-opt -gpu-kernel-outlining  | /home/uday/llvm-project-upstream/build/bin/mlir-opt -pass-pipeline='builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-rocdl{chipset=gfx1100 index-bitwidth=32},gpu-to-hsaco{chip=gfx1100}))'  | /home/uday/llvm-project-upstream/build/bin/mlir-opt -gpu-to-llvm  | /home/uday/llvm-project-upstream/build/bin/mlir-cpu-runner    --shared-libs=/home/uday/llvm-project-upstream/build/lib/libmlir_rocm_runtime.so    --shared-libs=/home/uday/llvm-project-upstream/build/lib/libmlir_runner_utils.so    --entry-point-result=void  | /home/uday/llvm-project-upstream/build/bin/FileCheck /home/uday/llvm-project-upstream/mlir/test/Integration/GPU/ROCM/vector-transferops.mlir
--
Exit Code: 1

Command Output (stderr):
--
/home/uday/llvm-project-upstream/mlir/test/Integration/GPU/ROCM/vector-transferops.mlir:79:12: error: CHECK: expected string not found in input
 // CHECK: [1.23, 2.46, 2.46, 1.23]
           ^
<stdin>:1:1: note: scanning from here
Unranked Memref base@ = 0x562c99d56c80 rank = 1 offset = 0 sizes = [4] strides = [1] data = 
^
<stdin>:2:1: note: possible intended match here
[1.23, 1.23, 1.23, 1.23]
^

Input file: <stdin>
Check file: /home/uday/llvm-project-upstream/mlir/test/Integration/GPU/ROCM/vector-transferops.mlir

-dump-input=help explains the following input dump.

Input was:
<<<<<<
            1: Unranked Memref base@ = 0x562c99d56c80 rank = 1 offset = 0 sizes = [4] strides = [1] data =  
check:79'0     X~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ error: no match found
            2: [1.23, 1.23, 1.23, 1.23] 
check:79'0     ~~~~~~~~~~~~~~~~~~~~~~~~~
check:79'1     ?                         possible intended match
            3: Unranked Memref base@ = 0x562c99d56c80 rank = 1 offset = 0 sizes = [4] strides = [1] data =  
check:79'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            4: [1.23, 1.23, 1.23, 1.23] 
check:79'0     ~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>>>

--

                                                    -- Testing: 6 of 2061 tests, 6 workers --                                                     
FAIL: MLIR :: Integration/GPU/ROCM/printf.mlir (6 of 6)
******************** TEST 'MLIR :: Integration/GPU/ROCM/printf.mlir' FAILED ********************
Script:
--
: 'RUN: at line 1';   /home/uday/llvm-project-upstream/build/bin/mlir-opt /home/uday/llvm-project-upstream/mlir/test/Integration/GPU/ROCM/printf.mlir  | /home/uday/llvm-project-upstream/build/bin/mlir-opt -pass-pipeline='builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-rocdl{index-bitwidth=32 runtime=HIP},gpu-to-hsaco{chip=gfx1100}))'  | /home/uday/llvm-project-upstream/build/bin/mlir-opt -gpu-to-llvm  | /home/uday/llvm-project-upstream/build/bin/mlir-cpu-runner    --shared-libs=/home/uday/llvm-project-upstream/build/lib/libmlir_rocm_runtime.so    --shared-libs=/home/uday/llvm-project-upstream/build/lib/libmlir_runner_utils.so    --entry-point-result=void  | /home/uday/llvm-project-upstream/build/bin/FileCheck /home/uday/llvm-project-upstream/mlir/test/Integration/GPU/ROCM/printf.mlir
--
Exit Code: 2

Command Output (stderr):
--
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.  Program arguments: /home/uday/llvm-project-upstream/build/bin/mlir-cpu-runner --shared-libs=/home/uday/llvm-project-upstream/build/lib/libmlir_rocm_runtime.so --shared-libs=/home/uday/llvm-project-upstream/build/lib/libmlir_runner_utils.so --entry-point-result=void
 #0 0x0000561558409860 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/uday/llvm-project-upstream/build/bin/mlir-cpu-runner+0x9a4860)
 #1 0x0000561558406c44 SignalHandler(int) Signals.cpp:0:0
 #2 0x00007f8ad5a42520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
 #3 0x00007f8acd91fb68 (/opt/rocm/lib/libamdhip64.so.5+0x31fb68)
 #4 0x00007f8acd94435b (/opt/rocm/lib/libamdhip64.so.5+0x34435b)
 #5 0x00007f8acd94cdad (/opt/rocm/lib/libamdhip64.so.5+0x34cdad)
 #6 0x00007f8acd94d124 (/opt/rocm/lib/libamdhip64.so.5+0x34d124)
 #7 0x00007f8acd912444 (/opt/rocm/lib/libamdhip64.so.5+0x312444)
 #8 0x00007f8acd8055d0 (/opt/rocm/lib/libamdhip64.so.5+0x2055d0)
 #9 0x00007f8acd813369 hipModuleLaunchKernel (/opt/rocm/lib/libamdhip64.so.5+0x213369)
#10 0x00007f8ad40db874 mgpuLaunchKernel (/home/uday/llvm-project-upstream/build/lib/libmlir_rocm_runtime.so+0x39874)
#11 0x00007f8ad61a808e 
#12 0x00007f8ad61a80e1 
#13 0x00005615589ef37c compileAndExecute((anonymous namespace)::Options&, mlir::Operation*, llvm::StringRef, (anonymous namespace)::CompileAndExecuteConfig, void**, std::unique_ptr<llvm::TargetMachine, std::default_delete<llvm::TargetMachine>>) JitRunner.cpp:0:0
#14 0x00005615589efa27 compileAndExecuteVoidFunction((anonymous namespace)::Options&, mlir::Operation*, llvm::StringRef, (anonymous namespace)::CompileAndExecuteConfig, std::unique_ptr<llvm::TargetMachine, std::default_delete<llvm::TargetMachine>>) JitRunner.cpp:0:0
#15 0x00005615589ed7a5 mlir::JitRunnerMain(int, char**, mlir::DialectRegistry const&, mlir::JitRunnerConfig) (/home/uday/llvm-project-upstream/build/bin/mlir-cpu-runner+0xf887a5)
#16 0x000056155835742b main (/home/uday/llvm-project-upstream/build/bin/mlir-cpu-runner+0x8f242b)
#17 0x00007f8ad5a29d90 __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
#18 0x00007f8ad5a29e40 call_init ./csu/../csu/libc-start.c:128:20
#19 0x00007f8ad5a29e40 __libc_start_main ./csu/../csu/libc-start.c:379:5
#20 0x00005615583e8fc5 _start (/home/uday/llvm-project-upstream/build/bin/mlir-cpu-runner+0x983fc5)
FileCheck error: '<stdin>' is empty.
FileCheck command line:  /home/uday/llvm-project-upstream/build/bin/FileCheck /home/uday/llvm-project-upstream/mlir/test/Integration/GPU/ROCM/printf.mlir

--

********************
********************
Failed Tests (2):
  MLIR :: Integration/GPU/ROCM/printf.mlir
  MLIR :: Integration/GPU/ROCM/vector-transferops.mlir

Testing Time: 159.54s
  Excluded: 1633
  Passed  :    4
  Failed  :    2
FAILED: tools/mlir/test/CMakeFiles/check-mlir /home/uday/llvm-project-upstream/build/tools/mlir/test/CMakeFiles/check-mlir 
cd /home/uday/llvm-project-upstream/build/tools/mlir/test && /usr/bin/python3.10 /home/uday/llvm-project-upstream/build/./bin/llvm-lit -sv /home/uday/llvm-project-upstream/build/tools/mlir/test
ninja: build stopped: subcommand failed.

Tagging the authors based on the ChangeLog here.

CC: @krzysz00 @jerryyin

llvmbot commented 1 year ago

@llvm/issue-subscribers-mlir

llvmbot commented 1 year ago

@llvm/issue-subscribers-mlir-gpu

krzysz00 commented 1 year ago

Looks like my attempt to fix the commit message didn't go through, so, closed by 20c66a0c66340f

bondhugula commented 1 year ago

Thanks - the commit resolves the first test case, but rocm printf.mlir still fails for me at ed27d28f9a53d689c98a3bef26980e2858350548.

******************** TEST 'MLIR :: Integration/GPU/ROCM/printf.mlir' FAILED ********************
Script:
--
: 'RUN: at line 1';   /home/uday/llvm-project-upstream/build/bin/mlir-opt /home/uday/llvm-project-upstream/mlir/test/Integration/GPU/ROCM/printf.mlir  | /home/uday/llvm-project-upstream/build/bin/mlir-opt -pass-pipeline='builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-rocdl{index-bitwidth=32 runtime=HIP},gpu-to-hsaco{chip=gfx1100}))'  | /home/uday/llvm-project-upstream/build/bin/mlir-opt -gpu-to-llvm  | /home/uday/llvm-project-upstream/build/bin/mlir-cpu-runner    --shared-libs=/home/uday/llvm-project-upstream/build/lib/libmlir_rocm_runtime.so    --shared-libs=/home/uday/llvm-project-upstream/build/lib/libmlir_runner_utils.so    --entry-point-result=void  | /home/uday/llvm-project-upstream/build/bin/FileCheck /home/uday/llvm-project-upstream/mlir/test/Integration/GPU/ROCM/printf.mlir
--
Exit Code: 2

Command Output (stderr):
--
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.  Program arguments: /home/uday/llvm-project-upstream/build/bin/mlir-cpu-runner --shared-libs=/home/uday/llvm-project-upstream/build/lib/libmlir_rocm_runtime.so --shared-libs=/home/uday/llvm-project-upstream/build/lib/libmlir_runner_utils.so --entry-point-result=void
 #0 0x000055b4228b02c0 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/uday/llvm-project-upstream/build/bin/mlir-cpu-runner+0x9bf2c0)
 #1 0x000055b4228ad6a4 SignalHandler(int) Signals.cpp:0:0
 #2 0x00007fd5e4842520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
 #3 0x00007fd5e091fb68 (/opt/rocm/lib/libamdhip64.so.5+0x31fb68)
 #4 0x00007fd5e094435b (/opt/rocm/lib/libamdhip64.so.5+0x34435b)
 #5 0x00007fd5e094cdad (/opt/rocm/lib/libamdhip64.so.5+0x34cdad)
 #6 0x00007fd5e094d124 (/opt/rocm/lib/libamdhip64.so.5+0x34d124)
 #7 0x00007fd5e0912444 (/opt/rocm/lib/libamdhip64.so.5+0x312444)
 #8 0x00007fd5e08055d0 (/opt/rocm/lib/libamdhip64.so.5+0x2055d0)
 #9 0x00007fd5e0813369 hipModuleLaunchKernel (/opt/rocm/lib/libamdhip64.so.5+0x213369)
#10 0x00007fd5e4ade874 mgpuLaunchKernel (/home/uday/llvm-project-upstream/build/lib/libmlir_rocm_runtime.so+0x39874)
#11 0x00007fd5e502e08e 
#12 0x00007fd5e502e0e1 
#13 0x000055b422e9bd8c compileAndExecute((anonymous namespace)::Options&, mlir::Operation*, llvm::StringRef, (anonymous namespace)::CompileAndExecuteConfig, void**, std::unique_ptr<llvm::TargetMachine, std::default_delete<llvm::TargetMachine>>) JitRunner.cpp:0:0
#14 0x000055b422e9c437 compileAndExecuteVoidFunction((anonymous namespace)::Options&, mlir::Operation*, llvm::StringRef, (anonymous namespace)::CompileAndExecuteConfig, std::unique_ptr<llvm::TargetMachine, std::default_delete<llvm::TargetMachine>>) JitRunner.cpp:0:0
#15 0x000055b422e9a1b5 mlir::JitRunnerMain(int, char**, mlir::DialectRegistry const&, mlir::JitRunnerConfig) (/home/uday/llvm-project-upstream/build/bin/mlir-cpu-runner+0xfa91b5)
#16 0x000055b4227fe273 main (/home/uday/llvm-project-upstream/build/bin/mlir-cpu-runner+0x90d273)
#17 0x00007fd5e4829d90 __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
#18 0x00007fd5e4829e40 call_init ./csu/../csu/libc-start.c:128:20
#19 0x00007fd5e4829e40 __libc_start_main ./csu/../csu/libc-start.c:379:5
#20 0x000055b42288fa25 _start (/home/uday/llvm-project-upstream/build/bin/mlir-cpu-runner+0x99ea25)
FileCheck error: '<stdin>' is empty.
FileCheck command line:  /home/uday/llvm-project-upstream/build/bin/FileCheck /home/uday/llvm-project-upstream/mlir/test/Integration/GPU/ROCM/printf.mlir

--
pcf000 commented 1 year ago

I bumped into the printf.mlir problem a little while ago on our buildbot, and officially it's fixed with ROCm 5.6.0, which should be available soon. When I started using a 5.6.0 pre-release with the buildbot, the test stopped failing.

I'll verify that that also fixes it on your card (buildbot's is older) and find out if pre-release availability is a thing.

pcf000 commented 1 year ago

I haven't checked on the new card yet, but I do have a workaround, which is to do export LIT_XFAIL='Integration/GPU/ROCM/printf.mlir' before running check-mlir. That'll list it as an expected failure. When you install a version of ROCm that fixes the problem, it'll become an "unexpected pass" and signal to get your attention.

pcf000 commented 1 year ago

@bondhugula , I tested on a card like yours, and indeed it's the same bug. The upcoming ROCm 5.6 will fix it, or the LIT_XFAIL suggestion above will work around it.

pcf000 commented 1 year ago

@bondhugula , ROCm 5.6 is out. See https://rocm.docs.amd.com/en/latest/.

pcf000 commented 1 year ago

@bondhugula , ROCm 5.7 is out.