oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
40.83k stars 5.34k forks source link

RuntimeError: CUDA error: no kernel image is available for execution on the device #5799

Open UlvenDagoth opened 7 months ago

UlvenDagoth commented 7 months ago

Describe the bug

I Run into this error every time I try to load a model in any of the ExLlama loaders

Traceback (most recent call last): File "C:\Users\Burrf\Desktop\text-generation-webui-main\modules\ui_model_menu.py", line 245, in load_model_wrapper shared.model, shared.tokenizer = load_model(selected_model, loader) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Burrf\Desktop\text-generation-webui-main\modules\models.py", line 87, in load_model output = load_func_maploader ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Burrf\Desktop\text-generation-webui-main\modules\models.py", line 380, in ExLlamav2_HF_loader return Exllamav2HF.from_pretrained(model_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Burrf\Desktop\text-generation-webui-main\modules\exllamav2_hf.py", line 181, in from_pretrained return Exllamav2HF(config) ^^^^^^^^^^^^^^^^^^^ File "C:\Users\Burrf\Desktop\text-generation-webui-main\modules\exllamav2_hf.py", line 50, in init self.ex_model.load(split) File "C:\Users\Burrf\Desktop\text-generation-webui-main\installer_files\env\Lib\site-packages\exllamav2\model.py", line 266, in load for item in f: x = item File "C:\Users\Burrf\Desktop\text-generation-webui-main\installer_files\env\Lib\site-packages\exllamav2\model.py", line 284, in load_gen module.load() File "C:\Users\Burrf\Desktop\text-generation-webui-main\installer_files\env\Lib\site-packages\exllamav2\attn.py", line 191, in load self.v_proj.load() File "C:\Users\Burrf\Desktop\text-generation-webui-main\installer_files\env\Lib\site-packages\exllamav2\linear.py", line 45, in load if w is None: w = self.load_weight() ^^^^^^^^^^^^^^^^^^ File "C:\Users\Burrf\Desktop\text-generation-webui-main\installer_files\env\Lib\site-packages\exllamav2\module.py", line 111, in load_weight qtensors = self.load_multi(key, ["qweight", "qzeros", "scales", "g_idx", "bias"], override_key = override_key) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Burrf\Desktop\text-generation-webui-main\installer_files\env\Lib\site-packages\exllamav2\module.py", line 78, in load_multi tensors[k] = stfile.get_tensor(key + "." + k, device = self.device()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Burrf\Desktop\text-generation-webui-main\installer_files\env\Lib\site-packages\exllamav2\fasttensors.py", line 118, in get_tensor return f.get_tensor(key) ^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

I have no idea how to fix this as I have no coding knowledge, and I've looked EVERYWHERE for a fix. Please help me with this, as I'm at my wits end here.

Is there an existing issue for this?

Reproduction

Any time I load a model in ExLlamaV2_HF or ExLlamaV2

Screenshot

image

Logs

Traceback (most recent call last):
  File "C:\Users\Burrf\Desktop\text-generation-webui-main\modules\ui_model_menu.py", line 245, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(selected_model, loader)
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Burrf\Desktop\text-generation-webui-main\modules\models.py", line 87, in load_model
    output = load_func_map[loader](model_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Burrf\Desktop\text-generation-webui-main\modules\models.py", line 380, in ExLlamav2_HF_loader
    return Exllamav2HF.from_pretrained(model_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Burrf\Desktop\text-generation-webui-main\modules\exllamav2_hf.py", line 181, in from_pretrained
    return Exllamav2HF(config)
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Burrf\Desktop\text-generation-webui-main\modules\exllamav2_hf.py", line 50, in __init__
    self.ex_model.load(split)
  File "C:\Users\Burrf\Desktop\text-generation-webui-main\installer_files\env\Lib\site-packages\exllamav2\model.py", line 266, in load
    for item in f: x = item
  File "C:\Users\Burrf\Desktop\text-generation-webui-main\installer_files\env\Lib\site-packages\exllamav2\model.py", line 284, in load_gen
    module.load()
  File "C:\Users\Burrf\Desktop\text-generation-webui-main\installer_files\env\Lib\site-packages\exllamav2\attn.py", line 191, in load
    self.v_proj.load()
  File "C:\Users\Burrf\Desktop\text-generation-webui-main\installer_files\env\Lib\site-packages\exllamav2\linear.py", line 45, in load
    if w is None: w = self.load_weight()
                      ^^^^^^^^^^^^^^^^^^
  File "C:\Users\Burrf\Desktop\text-generation-webui-main\installer_files\env\Lib\site-packages\exllamav2\module.py", line 111, in load_weight
    qtensors = self.load_multi(key, ["qweight", "qzeros", "scales", "g_idx", "bias"], override_key = override_key)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Burrf\Desktop\text-generation-webui-main\installer_files\env\Lib\site-packages\exllamav2\module.py", line 78, in load_multi
    tensors[k] = stfile.get_tensor(key + "." + k, device = self.device())
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Burrf\Desktop\text-generation-webui-main\installer_files\env\Lib\site-packages\exllamav2\fasttensors.py", line 118, in get_tensor
    return f.get_tensor(key)
           ^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

System Info

Windows 10
NVIDIA
GTX 980 TI
Ph0rk0z commented 7 months ago

You would have to compile it yourself. Exllama isn't compiled for pascal, nor is performance on that card any good.

UlvenDagoth commented 7 months ago

So I'd be better off using a different model loader?

Ph0rk0z commented 7 months ago

Yep, pretty much.

Fleischkuechle commented 7 months ago

i have the same issue running this model TheBloke_dolphin-2.6-mistral-7B-AWQ , i compared the conda virtual environment with my alltalk_tts (pip list) what i found there is no cuda installed, i used the one click installer (start_windows.bat ) I also saw in the releases there was a update to torch 2.2 but in my pip list its at version 2.1 grafik pip list comparision oobabooga with alltalk grafik

KnowhereFern commented 6 months ago

Having the same issue trying to run a llama 3 70B Model:

128GB of RAM 48GB of VRAM

Error I get is:

RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

lugangqi commented 5 months ago

My M40 24g runs ExLlama the same way, 4060ti 16g works fine under cuda12.4. M40 seems that the author did not update the kernel compatible with it, I also asked for help under the ExLlama2 author yesterday, I do not know whether the author to fix this compatibility problem, M40 and 980ti with the same architecture core computing power 5.2 to meet cuda12.4

lugangqi commented 5 months ago

My M40 24g runs ExLlama the same way, 4060ti 16g works fine under cuda12.4. M40 seems that the author did not update the kernel compatible with it, I also asked for help under the ExLlama2 author yesterday, I do not know whether the author to fix this compatibility problem, M40 and 980ti with the same architecture core computing power 5.2 to meet cuda12.4

Touch-Night commented 5 months ago

You would have to compile it yourself. Exllama isn't compiled for pascal, nor is performance on that card any good.

*Maxwell

lugangqi commented 5 months ago

But I don't know how to compile, can you tell me?