rocm-5.2.0: onnxruntime_test_all failed with "hipErrorNoBinaryForGpu: Unable to find code object for all current devices!"

Sfinx commented 2 years ago

Describe the bug onnxruntime_test_all failed with "hipErrorNoBinaryForGpu: Unable to find code object for all current devices!" with v1.12.1 (70481649e3c2dba) and rocm-5.2.0

Urgency medium

System information

Linux Ubuntu 22.04
ONNX Runtime installed from git (70481649e3c2dba)
ONNX Runtime version: v1.12.1
Python version: 3.10
GCC/Compiler version (if compiling from source): gcc version 12.0.1 20220319 (experimental) [master r12-7719-g8ca61ad148f] (Ubuntu 12-20220319-1ubuntu1)
AMD ROCm version: 5.2.0-65
GPU model and memory: gfx900

To Reproduce

build runtime with rocm 5.2.0
launch all tests app

Expected behavior tests should pass

./onnxruntime_test_all
....
[       OK ] GraphTransformationTests.ComputationReductionTransformer_GatherND_MatMul (0 ms)
[ RUN      ] GraphTransformationTests.ComputationReductionTransformer_GatherND_E2E
2022-08-20 08:21:28.812582409 [W:onnxruntime:Default, computation_reduction.cc:284 ApplyImpl] node gathernd_1 up across node add_2

2022-08-20 08:21:28.812598680 [W:onnxruntime:Default, computation_reduction.cc:284 ApplyImpl] node gathernd_1 up across node matmul_2

2022-08-20 08:21:28.812609159 [W:onnxruntime:Default, computation_reduction.cc:284 ApplyImpl] node gathernd_1 up across node layer_norm_2

2022-08-20 08:21:28.812619178 [W:onnxruntime:Default, computation_reduction.cc:284 ApplyImpl] node gathernd_1 up across node gelu_1

2022-08-20 08:21:28.812628797 [W:onnxruntime:Default, computation_reduction.cc:284 ApplyImpl] node gathernd_1 up across node add_1

2022-08-20 08:21:28.812638365 [W:onnxruntime:Default, computation_reduction.cc:284 ApplyImpl] node gathernd_1 up across node matmul_1

2022-08-20 08:21:28.812645578 [W:onnxruntime:Default, computation_reduction.cc:277 ApplyImpl] node gathernd_1 stopped at node layer_norm_1
2022-08-20 08:21:28.812752170 [W:onnxruntime:Default, computation_reduction.cc:277 ApplyImpl] node gathernd_1 stopped at node layer_norm_1
"hipErrorNoBinaryForGpu: Unable to find code object for all current devices!"
Aborted (core dumped)

Sfinx commented 2 years ago

Checked with recent rocm-5.2.3 - the same error

baijumeswani commented 2 years ago

After syncing internally, I found that this error typically is representative of a configuration issue. For example, was the onnxruntime package built for a specific architecture and executed on another one?

Sfinx commented 2 years ago

No, my setup is AMD GPU only. ROCm allows to run as on CPU as on GPU but I did not find the option for onnxruntime_test_all binary to run it on CPU only so can't test even this.

ytaous commented 2 years ago

Hi, which docker image you are using? Can you please try:

https://github.com/microsoft/onnxruntime/blob/cb2601c5ea54f491bc8934d401fe041f498bf132/tools/ci_build/github/pai/rocm-ci-pipeline-env.Dockerfile#L1
sample build cmd - https://github.com/microsoft/onnxruntime/blob/cb2601c5ea54f491bc8934d401fe041f498bf132/tools/ci_build/github/azure-pipelines/orttraining-pai-ci-pipeline.yml#L45
then try again on the ./onnxruntime_test_all ?

Sfinx commented 2 years ago

Hi,

I do not use docker image and building directly in Ubuntu 22.04.1.

Tried this exact docker image (rocm/pytorch:rocm5.2_ubuntu20.04_py3.7_pytorch_1.11.0), results :

launched docker image with '--device=/dev/kfd --device=/dev/dri --security-opt seccomp=unconfined --group-add video' for further testing
the build slightly fails in three places because of missing includes so I had to fix it

/root/onnxruntime/onnxruntime/core/providers/shared_library/provider_bridge_provider.cc:41:10: fatal error: orttraining/training_ops/cpu/controlflow/group.h: No such file or directory
   41 | #include "orttraining/training_ops/cpu/controlflow/group.h"

In file included from /root/onnxruntime/include/onnxruntime/orttraining/core/framework/torch/torch_proxy.h:6,
                 from /root/onnxruntime/include/onnxruntime/orttraining/training_ops/cpu/torch/torch_custom_function_kernel_base.h:16,
                 from /root/onnxruntime/onnxruntime/core/providers/shared_library/provider_bridge_provider.cc:46:
/root/onnxruntime/include/onnxruntime/orttraining/core/framework/torch/python_common.h:16:10: fatal error: Python.h: No such file or directory
   16 | #include <Python.h>

In file included from /usr/include/python3.8/pystate.h:129,
                 from /usr/include/python3.8/genobject.h:11,
                 from /usr/include/python3.8/Python.h:121,
                 from /root/onnxruntime/include/onnxruntime/orttraining/core/framework/torch/python_common.h:16,
                 from /root/onnxruntime/include/onnxruntime/orttraining/core/framework/torch/custom_function_register.h:5,
                 from /root/onnxruntime/orttraining/orttraining/core/framework/torch/custom_function_register.cc:4:
/usr/include/python3.8/cpython/pystate.h:9:10: fatal error: cpython/initconfig.h: No such file or directory
    9 | #include "cpython/initconfig.h"

and the test failed in the same way with "hipErrorNoBinaryForGpu: Unable to find code object for all current devices!"

rocminfo output from inside docker:

root@819d81728758:/var/lib/jenkins# rocminfo 
ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 2700 Eight-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 7 2700 Eight-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3200                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    49245784(0x2ef6e58) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    49245784(0x2ef6e58) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    49245784(0x2ef6e58) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx900                             
  Uuid:                    GPU-0215236b5e301904               
  Marketing Name:          687F:C7                            
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      4096(0x1000) KB                    
  Chip ID:                 26751(0x687f)                      
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1301                               
  BDFID:                   2304                               
  Internal Node ID:        1                                  
  Compute Unit:            56                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx900:xnack-   
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***             
root@819d81728758:/var/lib/jenkins#

Can provide any additional info

ytaous commented 2 years ago

Hi, what's your VM SKU ? is it mi100, mi250x (mi200) ?

Sfinx commented 2 years ago

It is not VM - it is laptop with Vega64 (gfx900) - I've posted rocminfo above.

ytaous commented 2 years ago

Thanks. Unfortunately, this issue is not easy to debug, as it's typically platform/env specific. I checked a few internal discussion. There's tool "rocm-obj-ls" recommended to debug where this message is coming from, i.e., which .so is not supporting the target gpu gfx900 - https://github.com/ROCm-Developer-Tools/HIP/blob/develop/docs/markdown/obj_tooling.md

And I also checked our build machine and dev machine, they are gfx908/90a - FYI.

Sfinx commented 2 years ago

I do not see any reason why I should limit myself to some hardcoded by onnxruntime developers devices while ROCm has support all of them. Seems like onnxruntime is unusable for me.

Sfinx commented 2 years ago

This patch for those who do not want to use predefined by m$ devices, works for me:

diff --git a/cmake/onnxruntime_kernel_explorer.cmake b/cmake/onnxruntime_kernel_explorer.cmake
index ad91139..93c39fe 100644
--- a/cmake/onnxruntime_kernel_explorer.cmake
+++ b/cmake/onnxruntime_kernel_explorer.cmake
@@ -45,7 +45,7 @@ target_compile_definitions(kernel_explorer
 # handle kernel_explorer sources as hip language
 target_compile_options(kernel_explorer PRIVATE "-xhip")
 # TODO: use predefined AMDGPU_TARGETS
-target_compile_options(kernel_explorer PRIVATE "--offload-arch=gfx908" "--offload-arch=gfx90a")
+# target_compile_options(kernel_explorer PRIVATE "--offload-arch=gfx908" "--offload-arch=gfx90a")
 # https://github.com/ROCm-Developer-Tools/HIP/blob/4514f350849b1090954295f8f87a5f8d78bd781b/hip-lang-config.cmake.in
 target_link_libraries(kernel_explorer PRIVATE ${CLANGRT_BUILTINS})

diff --git a/cmake/onnxruntime_providers.cmake b/cmake/onnxruntime_providers.cmake
index d4d1fd7..d383e71 100644
--- a/cmake/onnxruntime_providers.cmake
+++ b/cmake/onnxruntime_providers.cmake
@@ -1418,10 +1418,10 @@ if (onnxruntime_USE_ROCM)
   list(APPEND HIP_CLANG_FLAGS -fno-gpu-rdc)

   # Generate GPU code for GFX9 Generation
-  list(APPEND HIP_CLANG_FLAGS --amdgpu-target=gfx906 --amdgpu-target=gfx908)
-  if (ROCM_VERSION_DEV_INT GREATER_EQUAL 50000)
-    list(APPEND HIP_CLANG_FLAGS --amdgpu-target=gfx90a --amdgpu-target=gfx1030)
-  endif()
+  #list(APPEND HIP_CLANG_FLAGS --amdgpu-target=gfx906 --amdgpu-target=gfx908)
+  #if (ROCM_VERSION_DEV_INT GREATER_EQUAL 50000)
+  #  list(APPEND HIP_CLANG_FLAGS --amdgpu-target=gfx90a --amdgpu-target=gfx1030)
+  #endif()

   #onnxruntime_add_shared_library_module(onnxruntime_providers_rocm ${onnxruntime_providers_rocm_src})
   hip_add_library(onnxruntime_providers_rocm MODULE ${onnxruntime_providers_rocm_src})

microsoft / onnxruntime

rocm-5.2.0: onnxruntime_test_all failed with "hipErrorNoBinaryForGpu: Unable to find code object for all current devices!" #12662