oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
39.67k stars 5.21k forks source link

Models failing to load in Exllama due to 'CUDA error: no kernel image is available for execution on the device' #4384

Closed kurukurukuru closed 9 months ago

kurukurukuru commented 11 months ago

Describe the bug

-edit- This is using Exllama. Did some more testing, and loading a model via llama.cpp and offloading to GPU works as expected.

I am trying to use 2x Tesla K80s, however when trying to load the model I get the error 'CUDA error: no kernel image is available for execution on the device'. The model does load in VRAM, but it seems it is failing at the stage after loading. I am currently using the repo drivers on Ubuntu 20.04 (sudo apt install nvidia-driver-470). The most recent CUDA version this GPU can run is 11.4. Tried driver 450, 460, each with their respective CUDA versions. No luck, same error. As per NVIDIA documentation (https://docs.nvidia.com/deploy/cuda-compatibility/), things should be compatible between minor versions?

Tried in both manual installation, and one-click. No dice. I can see that for GPUs <=CC 3.5, a different Torch package is needed. See: https://blog.nelsonliu.me/2020/10/13/newer-pytorch-binaries-for-older-gpus/ From what I've read, CC3.7 (K80) support is retained in current Torch versions. I tried the package anyways, doesn't seem like Torch 1.13 is compatible with the latest release.

I'm kinda at a loss here. Not sure what to do next.

Is there an existing issue for this?

Reproduction

-Install NVIDIA repo drivers with 'sudo apt install nvidia-driver-470' -Install oobabooga w/ CUDA 11.8 for Keplar GPUs -Try and launch a model

Screenshot

N/A

Logs


oobabooga:

2023-10-25 01:34:07 INFO:Loading TheBloke_Xwin-MLewd-13B-v0.2-GPTQ...
2023-10-25 01:34:12 ERROR:Failed to load the model.
Traceback (most recent call last):
  File "/home/h-/text-generation-webui/modules/ui_model_menu.py", line 201, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(shared.model_name, loader)
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/h-/text-generation-webui/modules/models.py", line 79, in load_model
    output = load_func_map[loader](model_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/h-/text-generation-webui/modules/models.py", line 342, in ExLlama_HF_loader
    return ExllamaHF.from_pretrained(model_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/h-/text-generation-webui/modules/exllama_hf.py", line 174, in from_pretrained
    return ExllamaHF(config)
           ^^^^^^^^^^^^^^^^^
  File "/home/h-/text-generation-webui/modules/exllama_hf.py", line 31, in __init__
    self.ex_model = ExLlama(self.ex_config)
                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/h-/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllama/model.py", line 903, in __init__
    temp_state = torch.zeros((config.max_input_len, config.intermediate_size), dtype = torch.float16, device = dev)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

NVIDIA-SMI:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.199.02   Driver Version: 470.199.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:03:00.0 Off |                    0 |
| N/A   60C    P0    61W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00000000:04:00.0 Off |                    0 |
| N/A   60C    P0    75W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 00000000:0F:00.0 Off |                    0 |
| N/A   49C    P0    61W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 00000000:10:00.0 Off |                    0 |
| N/A   53C    P0    77W / 149W |      0MiB / 11441MiB |     60%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

conda info:

h-@amanatsu:~/text-generation-webui$ ./cmd_linux.sh 
(/home/h-/text-generation-webui/installer_files/env) h-@amanatsu:~/text-generation-webui$ 
(/home/h-/text-generation-webui/installer_files/env) h-@amanatsu:~/text-generation-webui$ 
(/home/h-/text-generation-webui/installer_files/env) h-@amanatsu:~/text-generation-webui$ conda info

     active environment : /home/h-/text-generation-webui/installer_files/env
    active env location : /home/h-/text-generation-webui/installer_files/env
            shell level : 1
       user config file : /home/h-/.condarc
 populated config files : 
          conda version : 23.3.1
    conda-build version : not installed
         python version : 3.10.10.final.0
       virtual packages : __archspec=1=x86_64
                          __cuda=11.4=0
                          __glibc=2.31=0
                          __linux=5.4.0=0
                          __unix=0=0
       base environment : /home/h-/text-generation-webui/installer_files/conda  (writable)
      conda av data dir : /home/h-/text-generation-webui/installer_files/conda/etc/conda
  conda av metadata url : None
           channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/linux-64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : /home/h-/text-generation-webui/installer_files/conda/pkgs
                          /home/h-/.conda/pkgs
       envs directories : /home/h-/text-generation-webui/installer_files/conda/envs
                          /home/h-/.conda/envs
               platform : linux-64
             user-agent : conda/23.3.1 requests/2.28.1 CPython/3.10.10 Linux/5.4.0-165-generic ubuntu/20.04.6 glibc/2.31
                UID:GID : 1000:1000
             netrc file : None
           offline mode : False

(/home/h-/text-generation-webui/installer_files/env) h-@amanatsu:~/text-generation-webui$

System Info

Environment:
AMD R5 1600
DDR4 64GB
2x Tesla K80
Ubuntu 20.04.6 (also tried 22.04, no difference)

NVIDIA-SMI 470.199.02   Driver Version: 470.199.02   CUDA Version: 11.4
kurukurukuru commented 11 months ago

Any ideas?

github-actions[bot] commented 9 months ago

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

lugangqi commented 2 months ago

My M40 24g runs ExLlama the same way, 4060ti 16g works fine under cuda12.4. M40 seems that the author did not update the kernel compatible with it, I also asked for help under the ExLlama2 author yesterday, I do not know whether the author to fix this compatibility problem, M40 and 980ti with the same architecture core computing power 5.2 to meet cuda12.4