Open Said-Akbar opened 6 days ago
@lamikr , That error line comes from https://github.com/ROCm/clr/blob/rocm-6.1.x/hipamd/src/hip_global.cpp#L114 .
But I am not sure how to fix my issue above. Please, let me know if you have time to review this today. Thanks!
Hi, unfortunately I do not have myself the gfx906 for debug, so I only added added some patches that would be needed at least to get it build and start testing and added it's support as an experimental.
About your error, I have not never seen that kind of error, but it could be some kind of misconfiguration in rocBLAS related to src_projects/rocBLAS/library/src/blas3/Tensile/Logic/asm_full/vega10/vega10_Cijk_Alik_Bljk_HB_GB.yaml
But let's try to check first couple of basic issues step by step so I get basic info.
1) Can you paste me first the output of rocminfo command? I am interested in whether it detects your gpu and what information it shows from it.
2) Then are you able to build and run this test these test apps:
/opt/rocm_sdk_612/docs/examples/hipcc/hello_world /opt/rocm_sdk_612/docs/examples/opencl/check_opencl_caps
Hello @lamikr, Sure, here is the output of rocminfo.
tests:
cd /opt/rocm_sdk_612/docs/examples/hipcc/hello_world/
./build.sh
rm -f ./hello_world
rm -f hello_world.o
rm -f /opt/rocm_sdk_612/src/*.o
/opt/rocm_sdk_612/bin/hipcc -g -fPIE -c -o hello_world.o hello_world.cpp
/opt/rocm_sdk_612/bin/hipcc hello_world.o -fPIE -o hello_world
./hello_world
System minor: 0
System major: 9
Agent name: AMD Radeon Graphics
Kernel input: GdkknVnqkc
Expecting that kernel increases each character from input string by one
Kernel output string: HelloWorld
Output string matched with HelloWorld
Test ok!
Opencl test:
cd /opt/rocm_sdk_612/docs/examples/opencl/check_opencl_caps
make
/check_opencl_caps
number of opencl platform devices: 1
==============================
Platform id: 0
AMD Accelerated Parallel Processing
Advanced Micro Devices, Inc.
OpenCL 2.1 AMD-APP (3614.0)
FULL_PROFILE
cl_khr_icd cl_amd_event_callback
Number of devices found for platform: 2
---------------------------
Device id: 0
CL_DEVICE_VENDOR_ID: 0x1002
CL_DEVICE_TYPE: GPU
CL_DEVICE_VENDOR_ID: 0x1002
CL_DEVICE_MAX_COMPUTE_UNITS: 0x40
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 0x3
CL_DEVICE_MAX_WORK_GROUP_SIZE: 0x3
CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR: 0x4
CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT: 0x2
todo more information...
---------------------------
---------------------------
Device id: 1
CL_DEVICE_VENDOR_ID: 0x1002
CL_DEVICE_TYPE: GPU
CL_DEVICE_VENDOR_ID: 0x1002
CL_DEVICE_MAX_COMPUTE_UNITS: 0x40
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 0x3
CL_DEVICE_MAX_WORK_GROUP_SIZE: 0x3
CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR: 0x4
CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT: 0x2
todo more information...
---------------------------
==============================
by the way, gfx906 has 'Vega 20' GPUs, but not 'Vega 10' GPUs. Not sure if some instruction that does not exist in gfx906 is being called from llama.cpp.
Here is the app crash log :
Hi @lamikr,
I built rocm_sdk_builder on a freshly installed Ubuntu 24.04.1. It took 5 hours, 120GB of storage and many hours of fixing small issues during building the repo (reference: https://github.com/lamikr/rocm_sdk_builder/issues/175). Also, I chose gfx906 from
./babs.sh -c
.When I ran
./run_and_save_benchmarks.sh
, I got this message.Note the error at the bottom 'Cannot find Symbol with name'. I thought this would not be an issue with llama.cpp. However, I got a similar error in llama.cpp as well (I built it using
./babs.sh -b binfo/extra/ai_tools.blist
).llama.cpp is failing with a similar error. Note that this llama.cpp worked with the CPU when I do not set the ngl parameter (layer offloading). Please let me know if there is a fix.