guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!")

CyberShadow commented 3 years ago

Hi and thank you for all your work in putting this together.

I'm running into an error when trying to run a simple Python program. After having built and installed tensorflow-rocm and dependencies, I'm trying the following simple Python script:

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

It produces the following output:

2021-03-18 17:08:24.637639: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-03-18 17:08:24.639302: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-03-18 17:08:24.639375: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libamdhip64.so
/build/hip-rocclr/src/HIP-rocm-4.0.0/rocclr/hip_code_object.cpp:120: guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!")
[1]    2964311 abort (core dumped)  python test.py

I don't know which binary it is referring to that it cannot find. Grepping strace output for ENOENT does not produce any illuminating results.

rocminfo output

``` [37mROCk module is loaded[0m [37mAble to open /dev/kfd read-write[0m ===================== HSA System Attributes ===================== Runtime Version: 1.1 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE System Endianness: LITTLE ========== HSA Agents ========== ******* Agent 1 ******* Name: AMD Ryzen Threadripper 3960X 24-Core Processor Uuid: CPU-XX Marketing Name: AMD Ryzen Threadripper 3960X 24-Core Processor Vendor Name: CPU Feature: None specified Profile: FULL_PROFILE Float Round Mode: NEAR Max Queue Number: 0(0x0) Queue Min Size: 0(0x0) Queue Max Size: 0(0x0) Queue Type: MULTI Node: 0 Device Type: CPU Cache Info: L1: 32768(0x8000) KB Chip ID: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 3800 BDFID: 0 Internal Node ID: 0 Compute Unit: 48 SIMDs per CU: 0 Shader Engines: 0 Shader Arrs. per Eng.: 0 WatchPts on Addr. Ranges:1 Features: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 131776176(0x7dabeb0) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 2 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 131776176(0x7dabeb0) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE ISA Info: N/A ******* Agent 2 ******* Name: gfx900 Uuid: GPU-021502761a624124 Marketing Name: Vega 10 XL/XT [Radeon RX Vega 56/64] Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 4096(0x1000) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 1 Device Type: GPU Cache Info: L1: 16(0x10) KB Chip ID: 26751(0x687f) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 1630 BDFID: 19712 Internal Node ID: 1 Compute Unit: 64 SIMDs per CU: 4 Shader Engines: 4 Shader Arrs. per Eng.: 1 WatchPts on Addr. Ranges:4 Features: KERNEL_DISPATCH Fast F16 Operation: FALSE Wavefront Size: 64(0x40) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 40(0x28) Max Work-item Per CU: 2560(0xa00) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 8372224(0x7fc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx900 Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 *** Done *** ```

clinfo output

``` Number of platforms 1 Platform Name AMD Accelerated Parallel Processing Platform Vendor Advanced Micro Devices, Inc. Platform Version OpenCL 2.0 AMD-APP.dbg (3212.0) Platform Profile FULL_PROFILE Platform Extensions cl_khr_icd cl_amd_event_callback Platform Extensions function suffix AMD Platform Name AMD Accelerated Parallel Processing Number of devices 1 Device Name gfx900 Device Vendor Advanced Micro Devices, Inc. Device Vendor ID 0x1002 Device Version OpenCL 2.0 Driver Version 3212.0 (HSA1.1,LC) Device OpenCL C Version OpenCL C 2.0 Device Type GPU Device Board Name (AMD) Vega 10 XL/XT [Radeon RX Vega 56/64] Device PCI-e ID (AMD) 0x687f Device Topology (AMD) PCI-E, 0000:4d:00.0 Device Profile FULL_PROFILE Device Available Yes Compiler Available Yes Linker Available Yes Max compute units 64 SIMD per compute unit (AMD) 4 SIMD width (AMD) 16 SIMD instruction width (AMD) 1 Max clock frequency 1630MHz Graphics IP (AMD) 9.0 Device Partition (core) Max number of sub-devices 64 Supported partition types None Supported affinity domains (n/a) Max work item dimensions 3 Max work item sizes 1024x1024x1024 Max work group size 256 Preferred work group size (AMD) 256 Max work group size (AMD) 1024 Preferred work group size multiple (kernel) 64 Wavefront width (AMD) 64 Preferred / native vector sizes char 4 / 4 short 2 / 2 int 1 / 1 long 1 / 1 half 1 / 1 (cl_khr_fp16) float 1 / 1 double 1 / 1 (cl_khr_fp64) Half-precision Floating-point support (cl_khr_fp16) Denormals No Infinity and NANs No Round to nearest No Round to zero No Round to infinity No IEEE754-2008 fused multiply-add No Support is emulated in software No Single-precision Floating-point support (core) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Correctly-rounded divide and sqrt operations Yes Double-precision Floating-point support (cl_khr_fp64) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Address bits 64, Little-Endian Global memory size 8573157376 (7.984GiB) Global free memory (AMD) 8372224 (7.984GiB) 8372224 (7.984GiB) Global memory channels (AMD) 64 Global memory banks per channel (AMD) 4 Global memory bank width (AMD) 256 bytes Error Correction support No Max memory allocation 7287183768 (6.787GiB) Unified memory for Host and Device No Shared Virtual Memory (SVM) capabilities (core) Coarse-grained buffer sharing Yes Fine-grained buffer sharing Yes Fine-grained system sharing No Atomics No Minimum alignment for any data type 128 bytes Alignment of base address 1024 bits (128 bytes) Preferred alignment for atomics SVM 0 bytes Global 0 bytes Local 0 bytes Max size for global variable 7287183768 (6.787GiB) Preferred total size of global vars 8573157376 (7.984GiB) Global Memory cache type Read/Write Global Memory cache size 16384 (16KiB) Global Memory cache line size 64 bytes Image support Yes Max number of samplers per kernel 26751 Max size for 1D images from buffer 4294967295 pixels Max 1D or 2D image array size 8192 images Base address alignment for 2D image buffers 256 bytes Pitch alignment for 2D image buffers 256 pixels Max 2D image size 16384x16384 pixels Max 3D image size 16384x16384x8192 pixels Max number of read image args 128 Max number of write image args 8 Max number of read/write image args 64 Max number of pipe args 16 Max active pipe reservations 16 Max pipe packet size 2992216472 (2.787GiB) Local memory type Local Local memory size 65536 (64KiB) Local memory size per CU (AMD) 65536 (64KiB) Local memory banks (AMD) 32 Max number of constant args 8 Max constant buffer size 7287183768 (6.787GiB) Preferred constant buffer size (AMD) 16384 (16KiB) Max size of kernel argument 1024 Queue properties (on host) Out-of-order execution No Profiling Yes Queue properties (on device) Out-of-order execution Yes Profiling Yes Preferred size 262144 (256KiB) Max size 8388608 (8MiB) Max queues on device 1 Max events on device 1024 Prefer user sync for interop Yes Number of P2P devices (AMD) 0 Profiling timer resolution 1ns Profiling timer offset since Epoch (AMD) 0ns (Thu Jan 1 00:00:00 1970) Execution capabilities Run OpenCL kernels Yes Run native kernels No Thread trace supported (AMD) No Number of async queues (AMD) 8 Max real-time compute queues (AMD) 8 Max real-time compute units (AMD) 64 printf() buffer size 4194304 (4MiB) Built-in kernels (n/a) Device Extensions cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program NULL platform behavior clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) No platform clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) No platform clCreateContext(NULL, ...) [default] No platform clCreateContext(NULL, ...) [other] Success [AMD] clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1) Platform Name AMD Accelerated Parallel Processing Device Name gfx900 clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1) Platform Name AMD Accelerated Parallel Processing Device Name gfx900 clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1) Platform Name AMD Accelerated Parallel Processing Device Name gfx900 ```

Any advice on how to narrow down this problem would be appreciated.

Also, I don't know if this is related or not, but I'm slightly confused about the subject of kernel modules. Information such as here seems to say that I should expect to have amdkfd loaded. For me, it is not loaded, which may or may not be related to the problem above. However, despite this, /dev/kfd exists and rocminfo says "ROCk module is loaded", and I see that this repository's README.md doesn't mention kernel modules. Regardless if I try to install rock-dkms-bin, the DKMS build fails due to not being able to find kcl/backport/kcl_reservation_backport.h et al.

CyberShadow commented 3 years ago

With https://github.com/ROCm-Developer-Tools/HIP/issues/2166#issuecomment-735004539 as a starting point, I traced what was causing the error:

The problem occurs because the GPU binary blob passed to libamdhip64.so doesn't have the target we want.
- The blob contains:
- host-x86_64-unknown-linux
- triple=hip-amdgcn-amd-amdhsa-gfx803ELF@
- But we want something with gfx900.
This binary is sent to libamdhip64.so (via __hipRegisterFatBinary) from librccl.so (via __hip_module_ctor) in its module constructor.
This module constructor is apparently generated by the compiler: https://github.com/llvm-mirror/clang/blob/master/lib/CodeGen/CGCUDANV.cpp#L477
Checking the strings of the tensorflow package does seem to confirm that its GPU blobs are built only for the gfx803 target:

$ pacman -Qql python-tensorflow-opt-rocm | grep -v /$ | xargs strings | grep hip-amdgcn-amd-amdhsa 
strings: Warning: '/usr/lib/python3.9/site-packages/tensorflow/include/Eigen' is a directory
strings: Warning: '/usr/lib/python3.9/site-packages/tensorflow/include/absl' is a directory
strings: Warning: '/usr/lib/python3.9/site-packages/tensorflow/include/external' is a directory
strings: Warning: '/usr/lib/python3.9/site-packages/tensorflow/include/google' is a directory
strings: Warning: '/usr/lib/python3.9/site-packages/tensorflow/include/include' is a directory
strings: Warning: '/usr/lib/python3.9/site-packages/tensorflow/include/tensorflow' is a directory
strings: Warning: '/usr/lib/python3.9/site-packages/tensorflow/include/third_party' is a directory
strings: Warning: '/usr/lib/python3.9/site-packages/tensorflow/include/unsupported' is a directory
hip-amdgcn-amd-amdhsa-gfx803
hip-amdgcn-amd-amdhsa-gfx803
hip-amdgcn-amd-amdhsa-gfx803
hip-amdgcn-amd-amdhsa-gfx803
hip-amdgcn-amd-amdhsa-gfx803
hip-amdgcn-amd-amdhsa-gfx803
hip-amdgcn-amd-amdhsa-gfx803
hip-amdgcn-amd-amdhsa-gfx803
hip-amdgcn-amd-amdhsa-gfx803
hip-amdgcn-amd-amdhsa-gfx803
hip-amdgcn-amd-amdhsa-gfx803
hip-amdgcn-amd-amdhsa-gfx803
hip-amdgcn-amd-amdhsa-gfx803
hip-amdgcn-amd-amdhsa-gfx803
hip-amdgcn-amd-amdhsa-gfx803
hip-amdgcn-amd-amdhsa-gfx803
hip-amdgcn-amd-amdhsa-gfx803
... (etc.) ...

CyberShadow commented 3 years ago

Rebuilding tensorflow-rocm outside of a chroot now produces binary files which have hip-amdgcn-amd-amdhsa-gfx900 instead of -803. As expected, the newly built package works.

So, like others who encountered this problem before me, I stumbled my way to a solution, but the root cause remains undetermined. I suspect it may have something to do with the build order or chroot isolation.

acxz commented 10 months ago

@CyberShadow tensorflow-rocm has finally been fixed after being broken for a while now. If you want can you report back after trying the latest version? For right now I'll close this issue as it has been a long time. But feel free to comment on it and I'll open it back if the issue hasn't been resolved.

rocm-arch / tensorflow-rocm

guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!") #21