oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
40.42k stars 5.3k forks source link

AMD thread #3759

Open oobabooga opened 1 year ago

oobabooga commented 1 year ago

This thread is dedicated to discussing the setup of the webui on AMD GPUs.

You are welcome to ask questions as well as share your experiences, tips, and insights to make the process easier for all AMD users.

LasonHistory commented 1 year ago

@lufixSch

I would try exllamaV2 (or exllama)

exllama works fine.

Did you make sure, it ran on GPU?

Yes, i set n_gpu_layers and it works fine.

You can get a list of all packages with pip list

pip_list.txt I dont understand what except the three steps in the last comment, what else have to be done?

userbox020 commented 1 year ago

hello guys, finally got my RX 6800 detected by ooba, going to share the steps I did before installing ooba. Im using Ubuntu 22.04

------------------------------ UNINSTALL PAST ROCM

dpkg -l | grep rocm
#check all of them are in the follow command

sudo apt purge rocm-clang-ocl rocm-cmake rocm-core rocm-dbgapi rocm-debug-agent rocm-dev rocm-device-libs rocm-dkms rocm-gdb rocm-libs rocm-llvm rocm-ocl-icd rocm-opencl rocm-opencl-dev rocm-smi-lib rocm-utils rocminfo

sudo apt autoremove

sudo apt update

--------------------------- INSTALL ROCM

sudo apt update -y && sudo apt upgrade -y 
sudo apt-add-repository -y -s -s
sudo apt install -y "linux-headers-$(uname -r)" \
    "linux-modules-extra-$(uname -r)"   

sudo mkdir --parents --mode=0755 /etc/apt/keyrings

wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | \
    gpg --dearmor | sudo tee /etc/apt/keyrings/rocm.gpg > /dev/null

echo 'deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/5.6/ ubuntu jammy main' \
    | sudo tee /etc/apt/sources.list.d/amdgpu.list

sudo apt update -y 

sudo apt install -y amdgpu-dkms

echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/5.6 jammy main" \
    | sudo tee --append /etc/apt/sources.list.d/rocm.list

echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600' \
    | sudo tee /etc/apt/preferences.d/rocm-pin-600

sudo apt update -y
sudo apt install -y rocm-dev rocm-libs rocm-hip-sdk rocm-dkms rocm-libs
sudo apt install -y rocm-opencl rocm-opencl-dev
sudo apt install -y hipsparse hipblas hipblas-dev hipcub
sudo apt install -y rocblas rocblas-dev rccl rocthrust roctracer-dev 

# COPY AND RUN THE FOLLOW ALL TOGETHER
sudo tee --append /etc/ld.so.conf.d/rocm.conf <<EOF
/opt/rocm/lib
/opt/rocm/lib64
EOF
sudo ldconfig

# update path
echo "PATH=/opt/rocm/bin:/opt/rocm/opencl/bin:$PATH" >> ~/.profile
sudo /opt/rocm/bin/rocminfo | grep gfx
sudo adduser `whoami` video
sudo adduser `whoami` render

# git and git-lfs (large file support
sudo apt install -y git git-lfs
# development tool may be required later...
sudo apt install -y libstdc++-12-dev
# stable diffusion likes TCMalloc...
sudo apt install -y libtcmalloc-minimal4

sudo apt install -y nvtop 
sudo apt install -y radeontop rovclock
sudo reboot

Now I can use my amd gpu but I can only use llama.cpp loader with GGUF formats, when I try to use GPTQ format with any supported loader I get the follow error:

2023-10-16 02:22:17 ERROR:Failed to load the model.
Traceback (most recent call last):
  File "/media/10TB_HHD/_OOBAGOOBA-AMD/text-generation-webui/modules/ui_model_menu.py", line 201, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(shared.model_name, loader)
  File "/media/10TB_HHD/_OOBAGOOBA-AMD/text-generation-webui/modules/models.py", line 79, in load_model
    output = load_func_map[loader](model_name)
  File "/media/10TB_HHD/_OOBAGOOBA-AMD/text-generation-webui/modules/models.py", line 320, in AutoGPTQ_loader
    return modules.AutoGPTQ_loader.load_quantized(model_name)
  File "/media/10TB_HHD/_OOBAGOOBA-AMD/text-generation-webui/modules/AutoGPTQ_loader.py", line 57, in load_quantized
    model = AutoGPTQForCausalLM.from_quantized(path_to_model, **params)
  File "/media/10TB_HHD/_OOBAGOOBA-AMD/text-generation-webui/installer_files/env/lib/python3.10/site-packages/auto_gptq/modeling/auto.py", line 108, in from_quantized
    return quant_func(
  File "/media/10TB_HHD/_OOBAGOOBA-AMD/text-generation-webui/installer_files/env/lib/python3.10/site-packages/auto_gptq/modeling/_base.py", line 875, in from_quantized
    accelerate.utils.modeling.load_checkpoint_in_model(
  File "/media/10TB_HHD/_OOBAGOOBA-AMD/text-generation-webui/installer_files/env/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 1414, in load_checkpoint_in_model
    set_module_tensor_to_device(
  File "/media/10TB_HHD/_OOBAGOOBA-AMD/text-generation-webui/installer_files/env/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 291, in set_module_tensor_to_device
    value = value.to(old_value.dtype)
RuntimeError: HIP error: the operation cannot be performed in the present state
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

Any idea what should I do? thanks in advance

userbox020 commented 1 year ago

Can we make an AMD channel on discord server please? @oobabooga

userbox020 commented 1 year ago

I think this is why Autogptq not working on rocm5.6

https://github.com/PanQiWei/AutoGPTQ/commit/3de7fbb0d53ccc4516910a7a4000d526c6289d2a

fractal-fumbler commented 1 year ago

hello

was anybody successful at compiling https://github.com/ROCmSoftwarePlatform/flash-attention?

lufixSch commented 1 year ago

@fractal-fumbler I haven't tried because the last time I checked they did not yet support flash-attention 2.

There was an open PR for flash-attention 2 but I can't find it (maybe ROCmSoftwarePlatform/flash-attention#14).

gel-crabs commented 1 year ago

@fractal-fumbler I haven't tried because the last time I checked they did not yet support flash-attention 2.

There was an open PR for flash-attention 2 but I can't find it (maybe ROCmSoftwarePlatform/flash-attention#14).

That's the one. I use A1111 but that one does work, albeit with slower speed. It is actively being developed, ROCM's PyTorch repo has some branches being actively developed that add Flash Attention V2 support as well (they don't build)

acbp commented 12 months ago

Hey guys, following the guide on Linux ( Manjaro Plasma 22. - kernel 6.5. ) AMD GPU ( RX480 ) worked without much trouble I'm so happy ! I've struggling with sd-web-ui until now... I'll review my steps and post it later.

acbp commented 11 months ago

my system conf:

# 1 step - kernel
command: `uname -a`
> Linux lu 6.5.5-1-MANJARO #1 SMP PREEMPT_DYNAMIC Sat Sep 23 12:48:15 UTC 2023 x86_64 GNU/Linux

# 1 step - AMD ROCM Info
command: `rocminfo`
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD Ryzen 5 3600 6-Core Processor
  Uuid:                    CPU-XX
  Marketing Name:          AMD Ryzen 5 3600 6-Core Processor
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU

*******
Agent 2
*******
  Name:                    gfx803
  Uuid:                    GPU-XX
  Marketing Name:          AMD Radeon RX 480 Graphics
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU

# 3 step - exporte path to rocm
command: `export ROCBLAS_TENSILE_LIBPATH=/opt/rocm/lib/rocblas/library/`
> to verify `echo $ROCBLAS_TENSILE_LIBPATH`
>> /opt/rocm/lib/rocblas/library/

the steps:

git clone https://github.com/oobabooga/text-generation-webui.git
cd text-generation-webui
python -m venv venv
source venv/bin/activate
./start_linux.sh

Note:

1 - on llama.cpp enable numa and load to gpu 2 - verify if the GPU is loading with "amdgpu_top" on another terminal ( should be installed already if follow the amd guide )

lufixSch commented 11 months ago

Hey did anyone of you try TheBloke/deepseek-coder-6.7B-instruct-GGUF yet?

Every time I try to load it it crashes with:

ERROR: byte not found in vocab: '
'
./webui.sh: line 33: 14936 Segmentation fault      (core dumped) python server.py $STARTUP_OPTIONS
lewis100 commented 11 months ago

Im trying to use Oobabooga on my RX 6750 XT, which means I have to use Linux for the first time in my life. After installing it, Ive followed every tutorial Ive seen about ROMC and Oobabooga on AMD, but in the end, I cant download a model, getting this error:

Traceback (most recent call last):

File "/home/lewis/text-generation-webui/modules/ui_model_menu.py", line 239, in download_model_wrapper

model, branch = downloader.sanitize_model_and_branch_names(repo_id, None) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/home/lewis/text-generation-webui/download-model.py", line 39, in sanitize_model_and_branch_names

if model[-1] == '/': ~^^^^

IndexError: string index out of range

And even if I try to use a downloaded model, it doesn`t load, giving this error on the terminal:

UserWarning: The value passed into gr.Dropdown() is not in the list of choices. Please update the list of choices to include: llama or set allow_custom_value=True.

warnings.warn(

2023-11-19 12:57:17 INFO:Loading LLaMA2-13B-Tiefighter.Q8_0.gguf...

2023-11-19 12:57:17 INFO:llama.cpp weights detected: models/LLaMA2-13B-Tiefighter.Q8_0.gguf

2023-11-19 12:57:17 INFO:Cache capacity is 0 bytes

ggml_init_cublas: found 2 ROCm devices:

Device 0: AMD Radeon RX 6750 XT, compute capability 10.3

Device 1: AMD Radeon Graphics, compute capability 10.3

Additionally, when I open Oobabooga, it warns me this on the terminal:

UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.

warn("The installed version of bitsandbytes was compiled without GPU support.

I cant understand a single thing, so please, explain as if Im 5-years-old.

THANKS IN ADVANCE!

ghost commented 11 months ago

"UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable."

The warning is normal, the provided bitsandbytes version is currently not compatible with ROCM. AFAIK there is a version for ROCM, but it's currently not used.

lufixSch commented 11 months ago

@lewis100 Hi, when you say you followed every tutorial. What did you do? Currently the WebUI should work on AMD without any special configuration. Did you use the start_linux.sh script?

Which Linux Distribution are you using?

I'm pretty sure, the downloading problem is a formatting problem. What are you entering inside of the text fields?

The bitsandbytes warning doesn't need to concern you as long as you use GGUF, GPTQ or AWG Models.

The problem with loading the model is a bit harder. Is that everything you get as output? Does the WebUI still work or does the process exit and you have to start it again. If we are lucky this is just an issue with full VRAM. The default configuration for LLaMa.cpp is way to high for the 6750 XT. Look at the last line of the Error in the Terminal where you started the WebUI. Does it say something like [Segmentation Fault] and Full Memory.

Ps.: you can use Markdown formatting to make your message more readable: Cheat Sheet. For example write error messages in code blocks.

lewis100 commented 11 months ago

Ive being trying for more than 8 hours, so Ive installed a lot on stuff on Linuxs terminal out of frustration. Im using start_linux.shand Ubuntu 22.04.3 LTS. Ive loadedcodellama-7b.Q4_0.ggufusing CPU and it worked! So I think it means the problem is with it using the GPU. Its recognizing it, because the last thing it say on the terminal is:

ggml_init_cublas: found 2 ROCm devices:
  Device 0: AMD Radeon RX 6750 XT, compute capability 10.3
  Device 1: AMD Radeon Graphics, compute capability 10.3

Does that mean that I have to somehow select which one I should use (Integrated vs dedicated)? I didnt see [Segmentation Fault] and Full Memory anywhere and Ive been trying out different configs on LLaMa.cpp, so I suppose it`s not a VRAM problem.

Edit: Im curious about theCache capacity is 0 bytes`. Is that normal?

lufixSch commented 11 months ago

Okay. Yeah installing a lot of things without knowing what they do can lead to a lot of problems. I too had to reinstall my OS once because I broke the GPU driver and was unable to fix it.

I just checked my output when loading a GGUF model. I get the following output for my RX 6750 XT:

2023-11-19 20:22:14 INFO:Loading settings from settings.yaml...
2023-11-19 20:22:14 INFO:Loading openhermes-2.5-mistral-7b.Q6_K.gguf...
2023-11-19 20:22:14 INFO:llama.cpp weights detected: models/openhermes-2.5-mistral-7b.Q6_K.gguf
2023-11-19 20:22:14 INFO:Cache capacity is 0 bytes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, compute capability 10.3

I think the problem lies with the second detected ROCm device (Device 0). Are you running a AMD CPU with internal GPU? From my output the 6750 XT should be recognized as AMD Radeon Graphics. It seems like your PC is detecting two of the same GPUs. I am not really sure how to fix that. What does rocminfo give as output? If you are running Ubuntu only for this, the easiest way might be to start again with a fresh install of Ubuntu.

Usually you should only need to install rocm (You can check if an installation already exists with rocminfo). And run the start_linux.sh script.

lewis100 commented 11 months ago

Yes, I think maybe the problem is with the integrated GPU on my Ryzen 5600G as well, but it could be with the installation of the ROCm. Here is the info (I have no idea what the Agent 3 is):


ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 5 5600G with Radeon Graphics
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 5 5600G with Radeon Graphics
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3900                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            12                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    15710808(0xefba58) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    15710808(0xefba58) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    15710808(0xefba58) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1030                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 6750 XT              
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      3072(0xc00) KB                     
    L3:                      98304(0x18000) KB                  
  Chip ID:                 29663(0x73df)                      
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2880                               
  BDFID:                   768                                
  Internal Node ID:        1                                  
  Compute Unit:            40                                 
  SIMDs per CU:            2                                  
  Shader Engines:          2                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    12566528(0xbfc000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1030         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 3                  
*******                  
  Name:                    gfx1030                            
  Uuid:                    GPU-XX                             
  Marketing Name:                                             
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      1024(0x400) KB                     
  Chip ID:                 5688(0x1638)                       
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1900                               
  BDFID:                   2560                               
  Internal Node ID:        2                                  
  Compute Unit:            7                                  
  SIMDs per CU:            4                                  
  Shader Engines:          1                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    524288(0x80000) KB                 
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1030         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***    
lufixSch commented 11 months ago

Yes I am pretty sure the third one should not be there. But the only thing I could suggest is a clean reinstall.

Additionally: I just looked up how to install ROCm on Ubuntu and the official ROCm documentation has a banner on the top of its page which says the following:

ROCm currently doesn’t support integrated graphics. Should your system have an AMD IGP installed, disable it in the BIOS prior to using ROCm. If the driver can enumerate the IGP, the ROCm runtime may crash the system, even if told to omit it via HIP_VISIBLE_DEVICES.

This may be the original reason why you faced issues during installation.

userbox020 commented 11 months ago

Hhhmm I still can't run any awg or gptq models on my gpu, only ggff versions. Im using ubuntu and the follow ooba version

git show ae8cd449ae3e0236ecb3775892bb1eea23f9ed68
git describe --tags
snapshot-2023-10-15-12-gae8cd44
oobabooga commented 11 months ago

AWQ is CUDA-only afaik, but AutoGPTQ, exllama (v1 at least), and as a last resort GPTQ-for-LLaMa should all work. Do you have ROCm 5.6 installed?

userbox020 commented 11 months ago

Sup bro, I have installed rocm 5.6 check here https://github.com/oobabooga/text-generation-webui/issues/3759#issuecomment-1763973978

This week going to do a fresh install of ooba and check if other models and transformers loader working with amd

lufixSch commented 11 months ago

ExllamaV2 has ROCm support out of the box. It should work too.

smCloudInTheSky commented 11 months ago

Hello ! As anyone being able to run this project using docker ? I'm trying to adapt the dockerfile by using rocm/pytorch instead of the nvidia/cuda:12.1.0-devel-ubuntu22.04 base iamge but I end up with the following error : × python setup.py egg_info did not run successfully. │ exit code: 1 ╰─> [10 lines of output] /app/venv/lib/python3.9/site-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.) device: torch.device = torch.device(torch._C._get_default_device()), # torch.device('cpu'), Traceback (most recent call last): File "", line 2, in File "", line 34, in File "/tmp/pip-install-8rjc8h2h/xformers_91967fbb8fb14ba2a856b79d3ff3493e/setup.py", line 239, in ext_modules=get_extensions(), File "/tmp/pip-install-8rjc8h2h/xformers_91967fbb8fb14ba2a856b79d3ff3493e/setup.py", line 157, in get_extensions raise RuntimeError( RuntimeError: CUTLASS submodule not found. Did you forget to run git submodule update --init --recursive ? [end of output] Cutlass is for NVIDIA right ? Shouldn't the pip install -r requirement_amd.txt be sufficient to dodge this issue ?

containerblaq1 commented 11 months ago

Hey there, small thing to add. If you encounter llama.cpp using the wrong GPU as your main device, this environment variable worked for re-ordering the devices for me before running text-generation-webui:

export CUDA_VISIBLE_DEVICES=1,0

I'm not sure if it'll work to exclude a device from use but will test later.

lufixSch commented 11 months ago

@smCloudInTheSky I never tried.

You might need to adjust a lot of things because the Dockerfile looks quite old. You should in any case use the requirements_amd.txt otherwise it wont work at all. I would also recommend removing the GPTQ-for-LLaMa part (for now) as it hasn't worked with AMD for some time.

Maybe you can open a Pull request for adding a AMD compatible Dockerfile. I would be happy to help building (or at least testing) it.

lufixSch commented 11 months ago

@containerblaq1 Thanks that's really helpful. I will be getting a second GPU in the next days and was already searching for something like that.

@lewis100 Maybe this also helps with your problem of multiple detected GPUs

smCloudInTheSky commented 11 months ago

@lufixSch In the end I found something that builded ! skipping GPTQ and using requiremlents_amd worked :+1: However when trying to run the docker I end up with this :

docker run text-generation-webui-text-generation-webui
/app/venv/lib/python3.10/site-packages/bitsandbytes/cextension.py:33: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
2023-11-21 07:55:00 INFO:Loading the extension "gallery"...
bin /app/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so
Traceback (most recent call last):
  File "/app/server.py", line 236, in <module>
    create_interface()
  File "/app/server.py", line 117, in create_interface
    ui_parameters.create_ui(shared.settings['preset'])  # Parameters tab
  File "/app/modules/ui_parameters.py", line 11, in create_ui
    generate_params = presets.load_preset(default_preset)
  File "/app/modules/presets.py", line 49, in load_preset
    with open(Path(f'presets/{name}.yaml'), 'r') as infile:
FileNotFoundError: [Errno 2] No such file or directory: 'presets/simple-1.yaml'

Even though my docker-compose is the one from the project it seems he didn't copy the folders properly.

cat docker-compose.yml
version: "3.3"
services:
  text-generation-webui:
    build:
      context: .
      args:
        # specify which cuda version your card supports: https://developer.nvidia.com/cuda-gpus
        TORCH_CUDA_ARCH_LIST: ${TORCH_CUDA_ARCH_LIST:-7.5}
        WEBUI_VERSION: ${WEBUI_VERSION:-HEAD}
    env_file: .env
    ports:
      - "${HOST_PORT:-7860}:${CONTAINER_PORT:-7860}"
      - "${HOST_API_PORT:-5000}:${CONTAINER_API_PORT:-5000}"
    stdin_open: true
    tty: true
    volumes:
      - ./characters:/app/characters
      - ./extensions:/app/extensions
      - ./loras:/app/loras
      - ./models:/app/models
      - ./presets:/app/presets
      - ./prompts:/app/prompts
      - ./softprompts:/app/softprompts
      - ./training:/app/training
      - ./cloudflared:/etc/cloudflared
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0']
              capabilities: [gpu]

Here is my current Dockerfile for curiosity. When it'll work I plan to write a script and propose it to use either mine or the original Dockerfile depending on the detected hardware (for linux at least)

cat Dockerfile
FROM rocm/dev-ubuntu-22.04:latest

LABEL maintainer="Your Name <your.email@example.com>"
LABEL description="Docker image for GPTQ-for-LLaMa and Text Generation WebUI"

RUN --mount=type=cache,target=/var/cache/apt,sharing=locked,rw apt-get update && \
    apt-get install --no-install-recommends -y python3-dev libportaudio2 libasound-dev git python3 python3-pip make g++ ffmpeg && \
    rm -rf /var/lib/apt/lists/*

RUN --mount=type=cache,target=/root/.cache/pip,rw pip3 install virtualenv

RUN mkdir /app

WORKDIR /app

ARG WEBUI_VERSION
RUN test -n "${WEBUI_VERSION}" && git reset --hard ${WEBUI_VERSION} || echo "Using provided webui source"

# Create virtualenv
RUN virtualenv /app/venv
RUN --mount=type=cache,target=/root/.cache/pip,rw \
    . /app/venv/bin/activate && \
    python3 -m pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm571/ && \
    python3 -m pip install --upgrade pip setuptools wheel ninja && \
    python3 -m pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.7 

# Install main requirements
COPY requirements_amd.txt /app/requirements_amd.txt
RUN --mount=type=cache,target=/root/.cache/pip,rw \
    . /app/venv/bin/activate && \
    python3 -m pip install -r requirements_amd.txt

COPY . /app/

RUN cp /app/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda121.so /app/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so

# Install extension requirements
RUN --mount=type=cache,target=/root/.cache/pip,rw \
    . /app/venv/bin/activate && \
    for ext in /app/extensions/*/requirements.txt; do \
    cd "$(dirname "$ext")"; \
    python3 -m pip install -r requirements.txt; \
    done

ENV CLI_ARGS=""

EXPOSE ${CONTAINER_PORT:-7860} ${CONTAINER_API_PORT:-5000} ${CONTAINER_API_STREAM_PORT:-5005}
CMD . /app/venv/bin/activate && python3 server.py ${CLI_ARGS}
lewis100 commented 11 months ago

How do I use "export CUDA_VISIBLE_DEVICES=1,0"? I'm a total newbie at this.

I've reinstalled Ubuntu 22.04.3 LTS and made somethings differently. Installing Oobabooga through Pinokio helped a lot, and it's not replying with a error anymore, but it's not using the GPU to offload.

I've even tried Kobold, but it doesn't recognize the GPU as well. I'm starting to feel that it's impossible to run a model with a RX 6750 XT.

Has anyone ever done that?

lewis100 commented 11 months ago

Never mind. I've randomly opened one of the Kobolds I've got laying around (koboldcpp_nocuda) and it totally worked on windows with the CLBlast. I'm so tired that I'll give up on Oobabooga for now and stick with Kobold.

lufixSch commented 11 months ago

but it's not using the GPU to offload

This is the default behaviour of lama.cpp. You need to increase the gpu offload slider in the UI before loading the model

Has anyone ever done that?

Yes, I‘m running a RX 6750 XT. But I run Manjaro as OS thats why I can’t really help you with the driver setup on Ubuntu

containerblaq1 commented 11 months ago

How do I use "export CUDA_VISIBLE_DEVICES=1,0"? I'm a total newbie at this.

You'll enter this in your terminal. From your output above, you would probably want to use export CUDA_VISIBLE_DEVICES=1

I've reinstalled Ubuntu 22.04.3 LTS and made somethings differently. Installing Oobabooga through Pinokio helped a lot, and it's not replying with a error anymore, but it's not using the GPU to offload.

Try increasing the gpu_layers slider in the UI.

I've even tried Kobold, but it doesn't recognize the GPU as well. I'm starting to feel that it's impossible to run a model with a RX 6750 XT.

It should work fine. If you're looking to give it another go, check out the Discord!

Has anyone ever done that?

I've used it on 6800XT, 7900XTX, 7900XT. Works well.

lewis100 commented 11 months ago

I've tried that on the terminal and didn't work. I'm using the slider as well. I didn't know about Discord. I'll check it out.

lufixSch commented 11 months ago

I've used it on 6800XT, 7900XTX, 7900XT. Works well.

@containerblaq1 Is there anything special I should know about setting up the 7900XTX or running the gui with multiple GPUs? I just got a 7900XT and want to add it to my system.

lewis100 commented 11 months ago

Ive tried to make it work one last time and it turns out it wasnt offloading the GPU because the following command wasn`t executed properly:

pip install torch==1.13.1+rocm5.2 torchvision==0.14.1+rocm5.2 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/rocm5.2

I had to use a previous version of Python for it to work. (Python 3.8). I hope this info helps someone. It`s finally offloading, but now I have a new error to deal with:

CUDA error 98 at /home/runner/work/llama-cpp-python-cuBLAS-wheels/llama-cpp-oython-cuBLAS-wheels/vendor/llma.cpp/ggml-cuda.cu:6951: invalid device function current device: 0 /arrow/cpp/src/arrow/filesystem/s3fs.cc:2904: arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit

It happens after my first prompt. Ive triedexport CUDA_VISIBLE_DEVICES=1andHIP_VISIBLE_DEVICES=1`, but no luck

containerblaq1 commented 11 months ago

I've used it on 6800XT, 7900XTX, 7900XT. Works well.

@containerblaq1 Is there anything special I should know about setting up the 7900XTX or running the gui with multiple GPUs? I just got a 7900XT and want to add it to my system.

I believe I just had to make hip again.

@lewis100

Try:

CMAKE_ARGS="-DLLAMA_HIPBLAS=on -DAMDGPU_TARGETS=gfx1032" CC=/opt/rocm/llvm/bin/clang CXX=/opt/rocm/llvm/bin/clang++ pip install llama_cpp_python --force-reinstall --no-cache-dir

Edit:

The above command came from this comment:

https://github.com/oobabooga/text-generation-webui/issues/3759#issuecomment-1735247373

lewis100 commented 11 months ago

I receiving the following error when I try that:


Downloading typing_extensions-4.8.0-py3-none-any.whl (31 kB)
Building wheels for collected packages: llama_cpp_python
  Building wheel for llama_cpp_python (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for llama_cpp_python (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [48 lines of output]
      *** scikit-build-core 0.6.1 using CMake 3.27.7 (wheel)
      *** Configuring CMake...
      2023-11-23 11:01:01,922 - scikit_build_core - WARNING - libdir/ldlibrary: /home/lewis/miniconda3/lib/libpython3.11.a is not a real file!
      2023-11-23 11:01:01,922 - scikit_build_core - WARNING - Can't find a Python library, got libdir=/home/lewis/miniconda3/lib, ldlibrary=libpython3.11.a, multiarch=x86_64-linux-gnu, masd=None
      loading initial cache file /tmp/tmp0rq5j8qe/build/CMakeInit.txt
      -- The C compiler identification is Clang 17.0.0
      -- The CXX compiler identification is Clang 17.0.0
      -- Detecting C compiler ABI info
      -- Detecting C compiler ABI info - done
      -- Check for working C compiler: /opt/rocm/llvm/bin/clang - skipped
      -- Detecting C compile features
      -- Detecting C compile features - done
      -- Detecting CXX compiler ABI info
      -- Detecting CXX compiler ABI info - failed
      -- Check for working CXX compiler: /opt/rocm/llvm/bin/clang++
      -- Check for working CXX compiler: /opt/rocm/llvm/bin/clang++ - broken
      CMake Error at /tmp/pip-build-env-ag82x3sc/normal/lib/python3.11/site-packages/cmake/data/share/cmake-3.27/Modules/CMakeTestCXXCompiler.cmake:60 (message):
        The C++ compiler

          "/opt/rocm/llvm/bin/clang++"

        is not able to compile a simple test program.

        It fails with the following output:

          Change Dir: '/tmp/tmp0rq5j8qe/build/CMakeFiles/CMakeScratch/TryCompile-9Jqhb1'

          Run Build Command(s): /tmp/pip-build-env-ag82x3sc/normal/lib/python3.11/site-packages/ninja/data/bin/ninja -v cmTC_665af
          [1/2] /opt/rocm/llvm/bin/clang++    -MD -MT CMakeFiles/cmTC_665af.dir/testCXXCompiler.cxx.o -MF CMakeFiles/cmTC_665af.dir/testCXXCompiler.cxx.o.d -o CMakeFiles/cmTC_665af.dir/testCXXCompiler.cxx.o -c /tmp/tmp0rq5j8qe/build/CMakeFiles/CMakeScratch/TryCompile-9Jqhb1/testCXXCompiler.cxx
          [2/2] : && /opt/rocm/llvm/bin/clang++   CMakeFiles/cmTC_665af.dir/testCXXCompiler.cxx.o -o cmTC_665af   && :
          FAILED: cmTC_665af
          : && /opt/rocm/llvm/bin/clang++   CMakeFiles/cmTC_665af.dir/testCXXCompiler.cxx.o -o cmTC_665af   && :
          ld.lld: error: unable to find library -lstdc++
          clang++: error: linker command failed with exit code 1 (use -v to see invocation)
          ninja: build stopped: subcommand failed.

        CMake will not be able to correctly generate this project.
      Call Stack (most recent call first):
        CMakeLists.txt:3 (project)

      -- Configuring incomplete, errors occurred!

      *** CMake configuration failed
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for llama_cpp_python
Failed to build llama_cpp_python
ERROR: Could not build wheels for llama_cpp_python, which is required to install pyproject.toml-based projects
lufixSch commented 11 months ago

From the Wiki:

Requires ROCm SDK 5.4.2 or 5.4.3 to be installed. Some systems may also need:

sudo apt-get install libstdc++-12-dev

lufixSch commented 11 months ago

I believe I just had to make hip again

@containerblaq1 What do you mean with that?

Are you able to split Models between (different) GPUs?

containerblaq1 commented 11 months ago

@lufixSch Yup! We briefly tested this a bit ago.

https://github.com/oobabooga/text-generation-webui/issues/3759#issuecomment-1741872811

Edit:

make hip ROCM_TARGET=gfx1100

lewis100 commented 11 months ago

@containerblaq1

Where should I use

CMAKE_ARGS="-DLLAMA_HIPBLAS=on -DAMDGPU_TARGETS=gfx1032" CC=/opt/rocm/llvm/bin/clang CXX=/opt/rocm/llvm/bin/clang++ pip install llama_cpp_python --force-reinstall --no-cache-dir

Is that on the main terminal? It's my first time using Linux. If so, I've tried and I keep getting CUDA error 98. My GPU is a RX 6750 XT, so I think it's a gfx1031, so I've tried it as well, but I'm still getting the same error.

containerblaq1 commented 11 months ago

@containerblaq1

Where should I use

CMAKE_ARGS="-DLLAMA_HIPBLAS=on -DAMDGPU_TARGETS=gfx1032" CC=/opt/rocm/llvm/bin/clang CXX=/opt/rocm/llvm/bin/clang++ pip install llama_cpp_python --force-reinstall --no-cache-dir

Is that on the main terminal? It's my first time using Linux. If so, I've tried and I keep getting CUDA error 98. My GPU is a RX 6750 XT, so I think it's a gfx1031, so I've tried it as well, but I'm still getting the same error.

The command is run in the terminal so that llama_cpp is built properly.

That GPU seems to be 1032. Please check here:

https://rocm.docs.amd.com/en/latest/release/windows_support.html

Please post the output of pip list in your terminal after making sure you are in the correct Conda environment.

It may be better at this point to recreate your conda environment from scratch

Via Google, use the search "conda destroy environment". FreeCodeCamp has a great tutorial on how to remove the conda environment.

Here is documentation on how to remove ROCm:

https://rocm.docs.amd.com/en/latest/deploy/linux/os-native/uninstall.html

Double Edit: Didn't register the XT on your card. Your ROCm output states gfx1030 -

Your ROCM output states gfx1030 - 

  Name:                    gfx1030                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 6750 XT  

So use:

CMAKE_ARGS="-DLLAMA_HIPBLAS=on -DAMDGPU_TARGETS=gfx1030" CC=/opt/rocm/llvm/bin/clang CXX=/opt/rocm/llvm/bin/clang++ pip install llama_cpp_python --force-reinstall --no-cache-dir

Please post llama.cpp's output before using the chat. Look for these lines:

llm_load_print_meta: general.name   = oobabooga_codebooga-34b-v0.1
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 2 '</s>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.14 MB
llm_load_tensors: using ROCm for GPU acceleration
ggml_cuda_set_main_device: using device 0 (Radeon RX 7900 XTX) as main device
llm_load_tensors: mem required  = 22733.87 MB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/51 layers to GPU
llm_load_tensors: VRAM used: 0.00 MB
antspartanelite commented 11 months ago

@lewis100 You said that you had tried

export CUDA_VISIBLE_DEVICES=1andHIP_VISIBLE_DEVICES=1

Have you tried HIP_VISIBLE_DEVICES=0 ?

Further up in the thread you posted the following:

 Device 0: AMD Radeon RX 6750 XT, compute capability 10.3
 Device 1: AMD Radeon Graphics, compute capability 10.3

From this it looks like your iGPU is device 1 and 6750 XT is device 0. It could be worth a try at least if you haven't already.

lewis100 commented 11 months ago

I've tried HIP_VISIBLE_DEVICES=0 as well and even disabled the iGPU.

It was not even loading the model, but then I've duplicated TensileLibrary_lazy_gfx1030 and renamed it TensileLibrary_lazy_gfx1031, and it loaded, but with CUDA error 98 after my first prompt. I wonder if I'm managing those environments properly, because I've never used Linux before.

containerblaq1 commented 11 months ago

@lewis100

Might be better to troubleshoot this in real time using some messenger. CUDA error 98 is the same error I got when llama.cpp was not properly being built which was fixed in this comment: https://github.com/oobabooga/text-generation-webui/issues/3759#issuecomment-1735247373

Where is the Name: field read from to be presented in rocminfo ? Does rocminfo show gfx1030 for you as well, @lufixSch ?

I'm unstabletable0321 on Discord.

containerblaq1 commented 11 months ago

@lufixSch I read up earlier in the chat and was able to run CodeBooga. Still having issues with that one?

lufixSch commented 11 months ago

Does rocminfo show gfx1030 for you as well, @lufixSch ?

I‘m not able to verify this right now but I‘m pretty sure it does

lufixSch commented 11 months ago

Didn't register the XT on your card. Your ROCm output states gfx1030

1030, 1031 and 1032 are basically the same architecture thats why you can just use 1030.

DocMAX commented 11 months ago

Can integrated graphics be used. In my case AMD Ryzen 7 5800H with Radeon Graphics. I get output from rocminfo... http://ix.io/4MBL

lufixSch commented 11 months ago

@DocMAX No because ROCm does not support them. From the ROCm Documentation:

ROCm currently doesn’t support integrated graphics. Should your system have an AMD IGP installed, disable it in the BIOS prior to using ROCm. If the driver can enumerate the IGP, the ROCm runtime may crash the system, even if told to omit it via HIP_VISIBLE_DEVICES

Lvjh1130 commented 11 months ago

Have you considered supporting AMD graphics cards on Windows through pytorch-directml? I am really looking forward to it.

lufixSch commented 11 months ago

Now that I had some time setting up my new 7900 XTX. Here are some of my findings and problems with running a dual GPU setup (6750 XT + 7900 XTX). Maybe it helps some of you and maybe some of you can help me.

Setup

Even though I already had a 6750 XT, I had to remove the GPU drivers and reinstall them. Otherwise the setup was easy. The text-generation-webui started directly without a need to reinstall anything

What works and what doesn't:

Error Message ```bash Traceback (most recent call last): File "/run/media/lukas/eb89b7d7-87c5-48bf-8c4a-7afc9925c04b/AI/LLM/WebUI/modules/ui_model_menu.py", line 209, in load_model_wrapper shared.model, shared.tokenizer = load_model(shared.model_name, loader) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/run/media/lukas/eb89b7d7-87c5-48bf-8c4a-7afc9925c04b/AI/LLM/WebUI/modules/models.py", line 85, in load_model output = load_func_map[loader](model_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/run/media/lukas/eb89b7d7-87c5-48bf-8c4a-7afc9925c04b/AI/LLM/WebUI/modules/models.py", line 351, in ExLlama_loader model, tokenizer = ExllamaModel.from_pretrained(model_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/run/media/lukas/eb89b7d7-87c5-48bf-8c4a-7afc9925c04b/AI/LLM/WebUI/modules/exllama.py", line 75, in from_pretrained model = ExLlama(config) ^^^^^^^^^^^^^^^ File "/run/media/lukas/eb89b7d7-87c5-48bf-8c4a-7afc9925c04b/AI/LLM/WebUI/installer_files/env/lib/python3.11/site-packages/exllama/model.py", line 867, in __init__ inv_freq = 1.0 / (self.config.rotary_embedding_base ** (torch.arange(0, self.config.head_dim, 2, device = device).float() / self.config.head_dim)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: HIP error: the operation cannot be performed in the present state HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing HIP_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions. ```

After updating torch (see llama.cpp) the error disappeared and I get a similar behavior as with exllamaV2

Additional Information

You can not run a 6xxx and 7xxx GPU and split the model between them (Maybe AMD will add official support for 6xxx to ROCm in the future which would probably make this possible). You need to select the right GPU with HIP_VISIBLE_DEVICES or CUDA_VISIBLE_DEVICES. Interestingly for me the ids where different from the output of rocm-smi. I used DRI_PRIME=<gpu id> glxinfo | grep "OpenGL renderer" to find the right id for each GPU but this command might be exclusive to Arch (or even Manjaro)