lamikr / rocm_sdk_builder

Other
137 stars 13 forks source link

gfx90c support - Renoir Vega 10 APU #112

Open daniandtheweb opened 4 months ago

daniandtheweb commented 4 months ago

I'm currently building the project for my laptop (Ryzen 4700U) but the integrated GPU is not officially supported. For now I've been able to successfuly build until rocBLAS but as the device is unsupported I can't get any further.

I'll try to modify the patches to add gfx90c since I've already tested the card with my distribution's rocm and overriding the GFX version to 9.0.0 makes everything work fine.

The only thing that I think it could be improved would be to add the ability to rocm to dynamically allocate RAM as vram since by default only 512 MB of memory are allocated as vram. I've tested some projects by manually allocating the vram from the bios and even if the GPU is quite slow compared to newer cards it still manages faster results than with the CPU only.

ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
Runtime Ext Version:     1.4
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 4700U with Radeon Graphics
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 7 4700U with Radeon Graphics
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   4214                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            8                                  
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    15709124(0xefb3c4) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    15709124(0xefb3c4) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    15709124(0xefb3c4) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx90c                             
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon Graphics                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      1024(0x400) KB                     
  Chip ID:                 5686(0x1636)                       
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1600                               
  BDFID:                   1024                               
  Internal Node ID:        1                                  
  Compute Unit:            7                                  
  SIMDs per CU:            4                                  
  Shader Engines:          1                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 472                                
  SDMA engine uCode::      40                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    7854560(0x77d9e0) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    7854560(0x77d9e0) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx90c:xnack-   
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***             
daniandtheweb commented 4 months ago

This is the current issue I'm having with rocBLAS:

Tensile::WARNING: Global parameter WriteMasterSolutionIndex = False unrecognized.
# CodeObjectVersion from TensileCreateLibrary: V5
# CxxCompiler       from TensileCreateLibrary: hipcc
# Architecture      from TensileCreateLibrary: gfx90c
# LibraryFormat     from TensileCreateLibrary: msgpack
Tensile::FATAL: Architecture gfx90c not supported
CMake Error at /home/daniandtheweb/Workspace/rocm_sdk_builder/builddir/023_02_rocBLAS/virtualenv/cmake/TensileConfig.cmake:277 (message):
  Error creating Tensile library: 255
Call Stack (most recent call first):
  library/src/CMakeLists.txt:74 (TensileCreateLibraryFiles)
daniandtheweb commented 4 months ago

The main issue seems to be in the Tensile project. As it doesn't have explicit support for this card.

lamikr commented 4 months ago

I can work on this at some point. I still have myself the 2400G with Vega11 and at some point I have had it running on the rocm-stack. At that time it needed also some kernel patching (5.05 kernel maybe) or it would fail on launching ml kernels. Hopefully newer kernels would now work out of the box.

daniandtheweb commented 4 months ago

I've tested some time ago Arch's rocm stack overriding the gfx version to 9.0.0 and everything worked without any kernel patching so, hopefully, it shouldn't require that much work adding at least a basic support for this card.

daniandtheweb commented 3 months ago

I've recently been testing the prebuilt pytorch for ROCm 6.1 again on this APU and it mostly works fine with the GFX version workaround. The good news is that a recent linux update (6.10) allows programs to directly access the GTT memory and use it as VRAM (the performance is quite slow but it's perfectly usable for a thin laptop and it effectively gives me 8gb of virtual vram: 40 seconds for a stable-diffusion 512x512 image).

I'm having some trouble running llama.cpp (rocBLAS related).

I'll try building again and see if I can workaround the issues I was having.

lamikr commented 3 weeks ago

For llama.cpp, I am not sure could it help if you try to test instead by overriding it instead for example to gfx1030 card so that it's detected as a RDNA2 card. There are couple of places where different amd gpu versions are checked by using defined(gfx900), defined(gfx1030), etc... For example

ggml/src/ggml-cuda/vendors/hip.h ggml/src/ggml-cuda/common.cuh