turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.53k stars 272 forks source link

Help me.i cuda error. #525

Open lugangqi opened 3 months ago

lugangqi commented 3 months ago

01:54:47-255686 INFO Starting Text generation web UI 01:54:47-260684 WARNING trust_remote_code is enabled. This is dangerous. 01:54:47-268684 INFO Loading the extension "openai" 01:54:47-469684 INFO OpenAI-compatible API URL:

                     http://127.0.0.1:5000

Running on local URL: http://127.0.0.1:7860

01:55:11-441029 INFO Loading "14b-exl" 01:55:12-675028 ERROR Failed to load the model. Traceback (most recent call last): File "D:\text-generation-webui\modules\ui_model_menu.py", line 249, in load_model_wrapper shared.model, shared.tokenizer = load_model(selected_model, loader) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\text-generation-webui\modules\models.py", line 94, in load_model output = load_func_maploader ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\text-generation-webui\modules\models.py", line 366, in ExLlamav2_loader from modules.exllamav2 import Exllamav2Model File "D:\text-generation-webui\modules\exllamav2.py", line 5, in from exllamav2 import ( File "D:\text-generation-webui\installer_files\env\Lib\site-packages\exllamav2__init__.py", line 3, in from exllamav2.model import ExLlamaV2 File "D:\text-generation-webui\installer_files\env\Lib\site-packages\exllamav2\model.py", line 25, in from exllamav2.linear import ExLlamaV2Linear File "D:\text-generation-webui\installer_files\env\Lib\site-packages\exllamav2\linear.py", line 7, in from exllamav2.module import ExLlamaV2Module File "D:\text-generation-webui\installer_files\env\Lib\site-packages\exllamav2\module.py", line 14, in os.environ["CUDA_LAUNCH_BLOCKING"] = "1" ^^ NameError: name 'os' is not defined

01:55:54-858096 INFO Loading "14b-exl" 01:55:56-017617 ERROR Failed to load the model. Traceback (most recent call last): File "D:\text-generation-webui\modules\ui_model_menu.py", line 249, in load_model_wrapper shared.model, shared.tokenizer = load_model(selected_model, loader) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\text-generation-webui\modules\models.py", line 94, in load_model output = load_func_maploader ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\text-generation-webui\modules\models.py", line 368, in ExLlamav2_loader model, tokenizer = Exllamav2Model.from_pretrained(model_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\text-generation-webui\modules\exllamav2.py", line 60, in from_pretrained model.load(split) File "D:\text-generation-webui\installer_files\env\Lib\site-packages\exllamav2\model.py", line 333, in load for item in f: x = item File "D:\text-generation-webui\installer_files\env\Lib\site-packages\exllamav2\model.py", line 356, in load_gen module.load() File "D:\text-generation-webui\installer_files\env\Lib\site-packages\exllamav2\attn.py", line 255, in load self.k_proj.load() File "D:\text-generation-webui\installer_files\env\Lib\site-packages\exllamav2\linear.py", line 92, in load if w is None: w = self.load_weight() ^^^^^^^^^^^^^^^^^^ File "D:\text-generation-webui\installer_files\env\Lib\site-packages\exllamav2\module.py", line 110, in load_weight qtensors = self.load_multi(key, ["q_weight", "q_invperm", "q_scale", "q_scale_max", "q_groups", "q_perm", "bias"]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\text-generation-webui\installer_files\env\Lib\site-packages\exllamav2\module.py", line 90, in load_multi tensors[k] = stfile.get_tensor(key + "." + k, device = self.device()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\text-generation-webui\installer_files\env\Lib\site-packages\exllamav2\fasttensors.py", line 204, in get_tensor tensor = f.get_tensor(key) ^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: no kernel image is available for execution on the device Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

my CPU:2666v3 memory:DDR3 ECC 32G 1866hz GPU:4060ti 16g and M40 24g

I think I found out how to force it to support 5.2 GPU,cc_flag.append("-gencode") cc_flag.append("arch=compute_50,code=sm_50"), but I don't know where to add, hope the developer saw to help me solve this problem

lugangqi commented 3 months ago

M40 computing power 5.2,4060ti computing power 8.9, ExLlamav2 do not support 5.2 computing power of the device to ask developers to update

lugangqi commented 3 months ago

请开发者更新一下M40显卡的镜像内核拜托,真的非常需要,我不太会用github,如果大家看到麻烦帮我联系一下开发者,谢谢

lugangqi commented 3 months ago

Please update the M40 graphics card mirroring kernel please, really need, I do not know how to use github, if you see the trouble to help me contact the developer, thank you

lugangqi commented 3 months ago

Please help me with the top question, thank you very much, it is very important to me

lugangqi commented 3 months ago

@turboderp

DocShotgun commented 3 months ago

M40 真的是太老了,不能用。在老的 GPU 像 RTX 10XX 或者 P40 用 Exllamav2 已经特别慢了。M40比他们更老。

要是想用 Exllamav2 ,最好是只用 RTX 30XX 和 RTX 40XX 级的 GPU。

lugangqi commented 3 months ago

Is it not compatible? Even slower is faster than llama...

lugangqi commented 3 months ago

不能兼容吗?再慢也比llama快呀。。。

DocShotgun commented 3 months ago

Just use llama.cpp/GGUF if you want to use M40

dnhkng commented 3 months ago

I agree. turboderp is doing a great job making LLMs go brrrrrrrrrrrrrrrrrrrrrrr

If he starts supporting old hardware, LLMs will only go brrrrrrr

remichu-ai commented 3 months ago

Exllama doesnt has as big a developer group like llama.cpp and the focus has always on faster speed. By definition, focusing on speed mean more optimisation and less compatibility compare to llama.cpp where it supports all type of models and hardware.

For older hardware, the gain in speed might not be worth it, so it is better to just use llama.cpp and let exllama focus the development effort on supporting newer model e.g. gemma2