lamikr / rocm_sdk_builder

Other
138 stars 13 forks source link

Radeon VII #175

Open commandline-be opened 3 weeks ago

commandline-be commented 3 weeks ago

owner of a Radeon VII card, if i can help testing code to run well on it, let me know

lamikr commented 2 weeks ago

Hi, that would be really interesting!

Most of the rocm components seems to had by default the support for in code for the Radeon VII/gfx906 in place by default and couple of weeks ago I went through all the common places typically requiring patching. I have tested that everything should now build for these cards but not been able to test functionality with VII.

That said, if you have time it would be great if you could try to make the build and test it. These steps should help you to get started.

git clone git@github.com:lamikr/rocm_sdk_builder.git 
cd rocm_sdk_builder
./install_deps.sh
./babs.sh -c (choose gfx906)
./babs.sh -b

And once the build has been progressed to building rocminfo and amd-smi, those command would be good way to start checking the build.


source /opt/rocm/bin/env_rocm.sh
rocminfo
amd-smi metrics

Hip and opencl -compiler tests should be also doable pretty soon (no need to wait whole build to finish)

source /opt/rocm/bin/env_rocm.sh
cd /opt/rocm_sdk_612/docs/examples/hipcc/hello_world
./build.sh
cd /opt/rocm_sdk_612/docs/examples/opencl/hello_world
./build.sh

Once the build has finished, if things works well, then also pytorch should have the support for you gpu. Some basic benchmarks are done by

cd benchmarks
./run_and_save_benchmarks.sh

If these works, then you can also build the llama_cpp, stable-diffusion-webui and vllm with command:

./babs.sh -b binfo/extra/ai_tools.blist

All of those have also own example apps you can run either on console or by starting their web-server and then connecting to it via browser. (I can help more later if needed)

commandline-be commented 1 week ago

Thanks :-) I'll try that asap

Said-Akbar commented 6 days ago

Hello @lamikr ,

Thank you for your amazing work! I am really glad I found this repo.

I have two AMD MI60 cards (gfx906). I will also compile this repo and share test results with you!

I am specifically interested in VLLM batch/concurrent inference speeds. So far, I was not able to compile VLLM with default installations of ROCM 6.2.2 and VLLM. Another issue I faced was lack of flash attention support. I see this repo has aotriton with support for gfx906. I hope aotriton implementation of flash attention works with this repo. Reference: https://github.com/ROCm/aotriton/pull/39

There is also composable_kernel based flash attention implementation here - https://github.com/ROCm/flash-attention (v2.6.3). This FA compiles fine with default ROCM 6.2.2 in Ubuntu 22.04 but exllamav2 backend with llama3 8B started generating gibberish text (exllamav2 works fine without FA2; but exllamav2 is very slow without FA2). I hope this repo fixes this gibberish text generation problem with FA2.

Thanks again!

Said-Akbar commented 6 days ago

Quick update. I did a fresh installation of Ubuntu 24.04.1 today which takes around 6.5GB SSD storage. It installs Nvidia GPU drivers by default. I assumed this repo would install AMD GPU drivers but no, it did not. Probably, this should be mentioning in README with a brief description of how to install GPU drivers. So, I installed AMD GPU drivers as follows:

sudo apt update
sudo apt install "linux-headers-$(uname -r)" "linux-modules-extra-$(uname -r)"
sudo usermod -a -G render,video $LOGNAME # Add the current user to the render and video groups
wget https://repo.radeon.com/amdgpu-install/6.2.4/ubuntu/noble/amdgpu-install_6.2.60204-1_all.deb
sudo apt install ./amdgpu-install_6.2.60204-1_all.deb
sudo apt update

Also, there were several packages missing in Ubuntu which I had to install after I saw error messages in ./install_deps.sh.

sudo apt install rpm
sudo apt install python3-pip
sudo apt install git-lfs

Only after that, I was able to run ./install_deps.sh without errors. I selected gfx906 for ./babs.sh -c and now I'm waiting for ./babs.sh -b to finish. So, far it has been running for 1.5 hours on my AMD 5950x CPU with 96GB DDR4 3200Mhz. Currently, the script is installing flang_libpgmath.

Another feedback. Can you please include a global progress bar that says how many packages were built and the total number of packages remaining in terminal logs?

Said-Akbar commented 6 days ago

ok, I want to report an error that occurred while building the source code. I ran ./babs.sh -b and after 1.5 hours, this is the error message I see:

-- LIBOMPTARGET: Not building hostexec for NVPTX because cuda not found
   -- Building hostexec with LLVM 17.0.0git found with CLANG_TOOL /opt/rocm_sdk_612/bin/clang
-- LIBOMPTARGET: Building the llvm-omp-device-info tool
-- LIBOMPTARGET: Building the llvm-omp-kernel-replay tool
-- LIBOMPTARGET: Building DeviceRTL. Using clang: /opt/rocm_sdk_612/bin/clang, llvm-link: /opt/rocm_sdk_612/bin/llvm-link and opt: /opt/rocm_sdk_612/bin/opt
-- LIBOMPTARGET: DeviceRTLs gfx906: Getting ROCm device libs from /opt/rocm_sdk_612/lib64/cmake/AMDDeviceLibs
 ===================> bc_files: Configuration.cpp-400-gfx906.bc;Debug.cpp-400-gfx906.bc;Kernel.cpp-400-gfx906.bc;LibC.cpp-400-gfx906.bc;Mapping.cpp-400-gfx906.bc;Misc.cpp-400-gfx906.bc;Parallelism.cpp-400-gfx906.bc;Reduction.cpp-400-gfx906.bc;State.cpp-400-gfx906.bc;Synchronization.cpp-400-gfx906.bc;Tasking.cpp-400-gfx906.bc;Utils.cpp-400-gfx906.bc;Workshare.cpp-400-gfx906.bc;ExtraMapping.cpp-400-gfx906.bc;Xteamr.cpp-400-gfx906.bc;Memory.cpp-400-gfx906.bc;Xteams.cpp-400-gfx906.bc;/home/saidp/Downloads/rocm_sdk_builder/builddir/016_03_llvm_project_openmp/libm-400-gfx906.bc;/home/saidp/Downloads/rocm_sdk_builder/builddir/016_03_llvm_project_openmp/libomptarget/hostexec/libhostexec-400-gfx906.bc;/opt/rocm_sdk_612/amdgcn/bitcode/ocml.bc;/opt/rocm_sdk_612/amdgcn/bitcode/ockl.bc;/opt/rocm_sdk_612/amdgcn/bitcode/oclc_wavefrontsize64_on.bc;/opt/rocm_sdk_612/amdgcn/bitcode/oclc_isa_version_906.bc;/opt/rocm_sdk_612/amdgcn/bitcode/oclc_abi_version_400.bc ========================
-- LIBOMPTARGET: DeviceRTLs gfx906: Getting ROCm device libs from /opt/rocm_sdk_612/lib64/cmake/AMDDeviceLibs
 ===================> bc_files: Configuration.cpp-500-gfx906.bc;Debug.cpp-500-gfx906.bc;Kernel.cpp-500-gfx906.bc;LibC.cpp-500-gfx906.bc;Mapping.cpp-500-gfx906.bc;Misc.cpp-500-gfx906.bc;Parallelism.cpp-500-gfx906.bc;Reduction.cpp-500-gfx906.bc;State.cpp-500-gfx906.bc;Synchronization.cpp-500-gfx906.bc;Tasking.cpp-500-gfx906.bc;Utils.cpp-500-gfx906.bc;Workshare.cpp-500-gfx906.bc;ExtraMapping.cpp-500-gfx906.bc;Xteamr.cpp-500-gfx906.bc;Memory.cpp-500-gfx906.bc;Xteams.cpp-500-gfx906.bc;/home/saidp/Downloads/rocm_sdk_builder/builddir/016_03_llvm_project_openmp/libm-500-gfx906.bc;/home/saidp/Downloads/rocm_sdk_builder/builddir/016_03_llvm_project_openmp/libomptarget/hostexec/libhostexec-500-gfx906.bc;/opt/rocm_sdk_612/amdgcn/bitcode/ocml.bc;/opt/rocm_sdk_612/amdgcn/bitcode/ockl.bc;/opt/rocm_sdk_612/amdgcn/bitcode/oclc_wavefrontsize64_on.bc;/opt/rocm_sdk_612/amdgcn/bitcode/oclc_isa_version_906.bc;/opt/rocm_sdk_612/amdgcn/bitcode/oclc_abi_version_500.bc ========================

CMake Error at cmake/OpenMPTesting.cmake:209 (add_custom_target):
  add_custom_target cannot create target
  "check-libomptarget-nvptx64-nvidia-cuda" because another target with the
  same name already exists.  The existing target is a custom target created
  in source directory
  "/home/saidp/Downloads/rocm_sdk_builder/src_projects/llvm-project/openmp/libomptarget/test".
  See documentation for policy CMP0002 for more details.
Call Stack (most recent call first):
  libomptarget/test/CMakeLists.txt:23 (add_openmp_testsuite)

CMake Error at cmake/OpenMPTesting.cmake:209 (add_custom_target):
  add_custom_target cannot create target
  "check-libomptarget-nvptx64-nvidia-cuda-LTO" because another target with
  the same name already exists.  The existing target is a custom target
  created in source directory
  "/home/saidp/Downloads/rocm_sdk_builder/src_projects/llvm-project/openmp/libomptarget/test".
  See documentation for policy CMP0002 for more details.
Call Stack (most recent call first):
  libomptarget/test/CMakeLists.txt:23 (add_openmp_testsuite)

Attaching the full error output ./babs.sh -b >>error_output.txt 2>&1 after running it the second time for reference: error_output.txt

Short info about my PC: OS: Ubuntu 24.04.1 CPU: AMD 5950x RAM: 96GB DDR4 3200Mhz Storage: SSD 1TB + HDD GPUs: RTX 3090 (for Video output), 2xAMD MI60 (gfx906).


I ran the following commands and they worked.

source /opt/rocm_sdk_612/bin/env_rocm.sh
rocminfo
amd-smi metric
cd /opt/rocm_sdk_612/docs/examples/hipcc/hello_world
./build.sh
cd /opt/rocm_sdk_612/docs/examples/opencl/hello_world
./build.sh

rocminfo correctly showed those two MI60 cards. hipcc and opencl examples worked without errors. Only ./run_and_save_benchmarks.sh did not work due to missing torch library.


Please, let me know if I need to install Cuda libraries or else, how I fix the error above.

Thanks!

Said-Akbar commented 6 days ago

@lamikr , I think the error I am seeing might be related to https://github.com/spack/spack/issues/45411 but not sure how I implement the fix here. Let me know. thanks!

Said-Akbar commented 5 days ago

Quick update. Installation is working after I remove all nvidia drivers and restart my PC.

sudo apt-get purge nvidia*
sudo apt-get autoremove
sudo apt-get autoclean

Now, Ubuntu is using X.Org Server Nouveau drivers.

Said-Akbar commented 4 days ago

Finally, ROCM SDK was installed on my PC after 5 hours. It takes ~90GB of space in rocm_sdk_builder, 8.5GB in the triton folder, ~2GB in the lib/x86_64-linux-gnu folder (mostly LLVM) and ~20GB in opt/rocm_sdk_612 folder. Total of 120GB of files! Is there a way to create an installable version of my current setup (all 120GB)? It is huge and time-consuming. For comparison, rocm installation from binaries takes around 30GB.

Said-Akbar commented 4 days ago

here are the benchmark results. I think the flash attention test failed.

./run_and_save_benchmarks.sh
Timestamp for benchmark results: 20241121_190404
Saving to file: 20241121_190404_cpu_vs_gpu_simple.txt
Benchmarking CPU and GPUs
Pytorch version: 2.4.1
ROCM HIP version: 6.1.40093-61a06a2f8
       Device:  AMD Ryzen 9 5950X 16-Core Processor
    'CPU time: 26.503 sec
       Device: AMD Radeon Graphics
    'GPU time: 0.399 sec
       Device: AMD Radeon Graphics
    'GPU time: 0.353 sec
Benchmark ready

Saving to file: 20241121_190404_pytorch_dot_products.txt
Pytorch version: 2.4.1
dot product calculation test
tensor([[[ 0.2042, -0.5683,  0.5711,  1.5666, -0.8859, -0.4255, -0.6103,
          -0.5932],
         [-0.1816, -1.0552,  0.3676,  2.1399, -0.8622,  0.1185, -0.4614,
          -0.4577],
         [ 0.2491, -0.5238,  0.5873,  1.5027, -0.8808, -0.4906, -0.6309,
          -0.6083]],

        [[-0.0812,  0.5027, -0.0134, -0.1771, -1.6389,  0.0154, -1.1964,
          -0.3948],
         [-0.3459, -0.4265,  0.0969,  0.0608, -0.9923, -0.4199, -0.7190,
          -0.0208],
         [-0.2615, -0.6958,  0.1066, -0.1948, -1.2152, -0.1223, -0.6278,
           0.1627]]], device='cuda:0')

Benchmarking cuda and cpu with Default, Math, Flash Attention amd Memory pytorch backends
Device: AMD Radeon Graphics / cuda:0
    Default benchmark:
:0:/home/saidp/Downloads/rocm_sdk_builder/src_projects/clr/hipamd/src/hip_global.cpp:114 : 8471950880 us: [pid:454884 tid:0x7ad2a9db0b80] Cannot find Symbol with name: Cijk_Alik_Bljk_HHS_BH_MT128x64x16_SE_APM1_AF0EM2_AF1EM1_AMAS3_ASAE01_ASCE01_ASEM2_BL1_BS1_DTLA0_DTLB0_EPS1_FL1_GLVWA4_GLVWB4_GRVW4_GSU1_GSUASB_ISA906_IU1_K1_KLA_LPA0_LPB0_LDL1_LRVW4_MDA1_MMFGLC_NLCA1_NLCB1_ONLL1_PK0_PGR1_PLR1_SIA1_SU32_SUM0_SUS256_SVW4_SNLL0_TT8_4_USFGROn1_VAW2_VSn1_VW4_VWB4_WG16_16_1_WGM1
Said-Akbar commented 4 days ago

that error above is causing llama.cpp not to run any models on GPU. Let me file a bug.

commandline-be commented 2 days ago

@lamikr finally got round to do the testing

initially the build went smooth-ish then i noticed something failed

after doing a ./bass.sh --clean and starting ./babs.sh -b again i now get an error on '

HIP_COMPILER=clang HIP_RUNTIME=rocclr ROCM_PATH=/opt/rocm_sdk_612 HIP_ROCCLR_HOME=/opt/rocm_sdk_612 HIP_CLANG_PATH=/opt/rocm_sdk_612/bin HIP_INCLUDE_PATH=/opt/rocm_sdk_612/include HIP_LIB_PATH=/opt/rocm_sdk_612/lib DEVICE_LIB_PATH=/opt/rocm_sdk_612/amdgcn/bitcode HIP_CLANG_RT_LIB=/opt/rocm_sdk_612/lib/clang/17/lib/linux hipcc-args: -DENABLE_BACKTRACE -DHAVE_BACKTRACE_H -I/usr/src/rocm_sdk_builder/src_projects/roctracer/src/util -O3 -DNDEBUG -fPIC -Wall -Werror -std=gnu++17 -MD -MT src/CMakeFiles/util.dir/util/debug.cpp.o -MF CMakeFiles/util.dir/util/debug.cpp.o.d -o CMakeFiles/util.dir/util/debug.cpp.o -c /usr/src/rocm_sdk_builder/src_projects/roctracer/src/util/debug.cpp hipcc-cmd: "/opt/rocm_sdk_612/bin/clang" -isystem "/opt/rocm_sdk_612/include" --offload-arch=gfx906 -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false --hip-path="/opt/rocm_sdk_612" --hip-device-lib-path="/opt/rocm_sdk_612/amdgcn/bitcode" -DENABLE_BACKTRACE -DHAVE_BACKTRACE_H -I/usr/src/rocm_sdk_builder/src_projects/roctracer/src/util -O3 -DNDEBUG -fPIC -Wall -Werror -std=gnu++17 -MD -MT src/CMakeFiles/util.dir/util/debug.cpp.o -MF CMakeFiles/util.dir/util/debug.cpp.o.d -o "CMakeFiles/util.dir/util/debug.cpp.o" -c -x hip /usr/src/rocm_sdk_builder/src_projects/roctracer/src/util/debug.cpp /usr/src/rocm_sdk_builder/src_projects/roctracer/src/util/debug.cpp:33:10: fatal error: 'backtrace.h' file not found 33 | #include | ^~~~~ 1 error generated when compiling for gfx906. make[2]: [src/CMakeFiles/util.dir/build.make:76: src/CMakeFiles/util.dir/util/debug.cpp.o] Error 1 make[2]: Leaving directory '/usr/src/rocm_sdk_builder/builddir/011_01_roctracer' make[1]: [CMakeFiles/Makefile2:220: src/CMakeFiles/util.dir/all] Error 2 make[1]: Leaving directory '/usr/src/rocm_sdk_builder/builddir/011_01_roctracer' make: *** [Makefile:156: all] Error 2 build failed: roctracer

lamikr commented 1 day ago

Hi. thanks for the reports. The flash attention support for gfx906 would need to be implemented in aotriton. As it's gfc based gpu, I need to check would the triton code there that supports newwer gfx9* cards could get to work also with gfx906.

Althought I do not have the gfx906, I will start a new build for it with ubuntu 24.04 and try to reproduce the build errors. If you have some fixes, are you able to make pull request?

commandline-be commented 1 day ago

hey @lamikr

The build is on LinuxMint Debian Edition, if need be i can make pull requests can you help identify the backtrace.h origin ?

lamikr commented 20 hours ago

I have multiple versions of it under src_projects directory

$ cd src_projects/
$ find -name backtrace.h

./rocgdb/libbacktrace/backtrace.h
./rocMLIR/external/llvm-project/compiler-rt/lib/gwp_asan/optional/backtrace.h
./binutils-gdb/libbacktrace/backtrace.h
./openmpi/opal/mca/backtrace/backtrace.h
./llvm-project/compiler-rt/lib/gwp_asan/optional/backtrace.h
./pytorch/third_party/tensorpipe/third_party/libnop/include/nop/utility/backtrace.h

I am not sure what is causing it. Maybe the install directory /opt/rocm_sdk_612 should also be removed and then start a clean build. Lets try to reset everything and then start a fresh build. (Normally this should not be needed and only command would be ./babs.sh -up and ./babs.sh -b to get only the changed projects rebuild)

./babs.sh -ca
./babs.sh -up
./babs.sh --clean
rm -rf /opt/rocm_sdk_612
./babs.sh -b

I have not solved yet the llama.cpp error with gfx906 but trying to add more debug to next build related to that. Lets track that issue on https://github.com/lamikr/rocm_sdk_builder/issues/180