vladmandic / automatic

SD.Next: Advanced Implementation of Stable Diffusion and other Diffusion-based generative image models
https://github.com/vladmandic/automatic
GNU Affero General Public License v3.0
5.56k stars 409 forks source link

[Issue]: Rocm AMD Radeon RX 6400 not used / recognized #1275

Closed MightyPork closed 1 year ago

MightyPork commented 1 year ago

Issue Description

I'm trying to get this tool working, after using Easy Diffusion for a while without problem - using export HSA_OVERRIDE_GFX_VERSION=10.3.0.

By itself it says nVidia CUDA toolkit detected despite there being no nVidia and no CUDA packages (but I used to have an nVidia card).

Everything rocm is installed.

I tried to force it using flags, sometimes there are random errors, xformers is removed and other times installed again, but either way, it never uses GPU acceleration.

I added rembg and xformers to requirements.txt, thinking that will help. rembg helped, there was a crash when it couldn't be found. xformers probably confused something.

/opt/sdnext (git)-[master] % ./webui.sh --experimental --reinstall --use-rocm
Create and activate python venv
Launching launch.py...
00:41:12-951991 INFO     Running extension preloading                                                                 
00:41:12-956514 INFO     Starting SD.Next                                                                             
00:41:12-957464 INFO     Python 3.11.3 on Linux                                                                       
00:41:12-967243 INFO     Version: 5f2bdba8 Fri Jun 2 12:56:44 2023 -0400                                              
00:41:13-231622 INFO     Setting environment tuning                                                                   
00:41:13-232921 INFO     Forcing reinstall of all packages                                                            
00:41:13-233952 INFO     AMD ROCm toolkit detected                                                                    
00:41:13-234645 INFO     Installing package: torch==2.0.0 torchvision==0.15.1 --index-url                             
                         https://download.pytorch.org/whl/rocm5.4.2                                                   
00:41:14-672970 ERROR    Error running pip: install --upgrade torch==2.0.0 torchvision==0.15.1 --index-url            
                         https://download.pytorch.org/whl/rocm5.4.2                                                   
00:41:15-820805 INFO     Torch 2.0.1+cu118                                                                            
00:41:15-914591 INFO     Installing package: tensorflow==2.12.0                                                       
00:41:19-090670 INFO     Verifying requirements                                                                       
00:41:19-093038 INFO     Installing package: addict                                                                   
00:41:21-546644 INFO     Installing package: aenum                                                                    
00:41:23-982640 INFO     Installing package: aiohttp            
...

now some things are uninstalled:

% ./webui.sh --experimental --use-rocm
Create and activate python venv
Launching launch.py...
00:45:12-052053 INFO     Running extension preloading                                                                 
00:45:12-056847 INFO     Starting SD.Next                                                                             
00:45:12-057842 INFO     Python 3.11.3 on Linux                                                                       
00:45:12-067872 INFO     Version: 5f2bdba8 Fri Jun 2 12:56:44 2023 -0400                                              
00:45:12-342813 INFO     Setting environment tuning                                                                   
00:45:12-346087 INFO     AMD ROCm toolkit detected                                                                    
00:45:13-516102 INFO     Torch 2.0.1+cu118                                                                            
00:45:13-597207 WARNING  Not used, uninstalling: xformers 0.0.20                                                      
00:45:13-598729 INFO     Installing package: un xformers --yes --quiet                                                
00:45:14-307632 INFO     Verifying requirements                                                                       
00:45:14-344259 WARNING  Package wrong version: numpy 1.24.3 required 1.23.5                                          
00:45:14-345265 INFO     Installing package: numpy==1.23.5 

Everything looks happy, but the GPU is not detected.

rocminfo:

ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 5 1600 Six-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 5 1600 Six-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3200                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            12                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    32795612(0x1f46bdc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32795612(0x1f46bdc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    32795612(0x1f46bdc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1030                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 6400                 
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      1024(0x400) KB                     
    L3:                      16384(0x4000) KB                   
  Chip ID:                 29759(0x743f)                      
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2320                               
  BDFID:                   2560                               
  Internal Node ID:        1                                  
  Compute Unit:            12                                 
  SIMDs per CU:            2                                  
  Shader Engines:          2                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    4177920(0x3fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1030         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***    

Some ideas what else to try?

Version Platform Description

arch, this tool freshly cloned and installed today

5f2bdba818d7307d98e46c70ef0dc185680b736b

vladmandic commented 1 year ago

By itself it says nVidia CUDA toolkit detected despite there being no nVidia and no CUDA packages (but I used to have an nVidia card).

you still have old nvidia utilities installed, for example nvidia-smi

I tried to force it using flags, sometimes there are random errors, xformers is removed and other times installed again,

if xformers are not selected as desired cross-attention method, they will be uninstalled. i wrote the reason for that in the latest update notes.

I added rembg and xformers to requirements.txt, thinking that will help

don't. if you want to force specific xformers, there are correct ways of doing that. editing requirements.txt is not the way. and rembg is handled automatically.

Everything looks happy, but the GPU is not detected.

its not, you can see that torch with cuda is installed instead of torch for rocm. that's because on initial install, it detected nvidia and you only added --use-rocm later. but installer will not force-change torch once its installed, you need to use --reinstall.

so to summarize:

MightyPork commented 1 year ago

Thanks for the assistance, I purged everything nvidia, then had to change the TORCH command to:

torch==2.0.1 torchvision==0.15.2 --index-url https://download.pytorch.org/whl/rocm5.4.2

now I'm at OutOfMemoryError: HIP out of memory. Tried to allocate 20.00 MiB (GPU 0; 3.98 GiB total capacity; 3.83 GiB already allocated; 66.00 MiB free; 3.92 GiB reserved in total by PyTorch).

I found advice to set export PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512, but that didn't help at all :(

It's unable to load any model, even the basic SD one. Easy Diffusion loads the exact same model fine, so it must be some wrong settings for pytorch (?) here.

This is a log from ED in case there's any hints what to do - "VRAM Optimizations" looks interesting, but i didn't find what it really means or does.

DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
01:55:42.079 INFO cuda:0 Created a temporary directory at /tmp/tmp3hkirdho                                                                                                  instantiator.py:21
01:55:42.080 INFO cuda:0 Writing /tmp/tmp3hkirdho/_remote_module_non_scriptable.py                                                                                          instantiator.py:76
01:55:48.529 INFO cuda:0 VRAM Optimizations: {'KEEP_ENTIRE_MODEL_IN_CPU', 'SET_ATTENTION_STEP_TO_16'}                                                                      optimizations.py:26
01:55:48.933 INFO cuda:0 Global seed set to 42                                                                                                                                      seed.py:65
Sampling:   0%|                                                                                                                                                         | 0/1 [00:00<?, ?it/s]01:55:50.396 INFO cuda:0 seeds used = [42]                                                                                                                                  sampler_main.py:64
Data shape for PLMS sampling is (1, 4, 8, 8)
Running PLMS Sampling with 1 timesteps                                           

The implementation for these optimizations seems to reside in sdkit/models/model_loader/stable_diffusion/optimizations.py

vladmandic commented 1 year ago

you have only 4GB, so you definitely need command line flag --medvram or even --lowvram i'm closing the issue as resolved as original issue is resolved, but feel free to post further question.

MightyPork commented 1 year ago

Yes, thanks for the tip. with medvram it didn't crash at startup, but initializing never moved past zero percent and kept eating all RAM until the system locked up.

lowvram has similar performance to what I previously saw with on this GPU

test render, took about 2 minutes, but it works.

00013-2884050668-dark green steam locomotive, twisting railway, mountains, birch trees, winter, snow