turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.53k stars 274 forks source link

Can load GPTQ models fine, but when running Can't infere gptq models, i get the follow tracebak #249

Closed userbox020 closed 2 months ago

userbox020 commented 9 months ago

Any other loaders loads fine on loading and inference

2023-12-27 22:03:03 INFO:Loading TheBloke_Nous-Hermes-13B-GPTQ...
Successfully preprocessed all matching files.
2023-12-27 22:03:04 WARNING:You are running ExLlamaV2 without flash-attention. This will cause the VRAM usage to be a lot higher than it could be.
Try installing flash-attention following the instructions here: https://github.com/Dao-AILab/flash-attention#installation-and-features
2023-12-27 22:03:05 ERROR:Failed to load the model.
Traceback (most recent call last):
  File "/media/10TB_HHD/_OOBAGOOBA-AMD_V2/text-generation-webui/modules/ui_model_menu.py", line 209, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(shared.model_name, loader)
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/10TB_HHD/_OOBAGOOBA-AMD_V2/text-generation-webui/modules/models.py", line 88, in load_model
    output = load_func_map[loader](model_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/10TB_HHD/_OOBAGOOBA-AMD_V2/text-generation-webui/modules/models.py", line 398, in ExLlamav2_loader
    model, tokenizer = Exllamav2Model.from_pretrained(model_name)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/10TB_HHD/_OOBAGOOBA-AMD_V2/text-generation-webui/modules/exllamav2.py", line 58, in from_pretrained
    model.load(split)
  File "/media/10TB_HHD/_OOBAGOOBA-AMD_V2/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/model.py", line 239, in load
    for item in f: return item
  File "/media/10TB_HHD/_OOBAGOOBA-AMD_V2/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/model.py", line 258, in load_gen
    module.load()
  File "/media/10TB_HHD/_OOBAGOOBA-AMD_V2/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/attn.py", line 78, in load
    self.input_layernorm.load()
  File "/media/10TB_HHD/_OOBAGOOBA-AMD_V2/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/rmsnorm.py", line 23, in load
    w = self.load_weight()
        ^^^^^^^^^^^^^^^^^^
  File "/media/10TB_HHD/_OOBAGOOBA-AMD_V2/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/module.py", line 94, in load_weight
    tensor = tensor.half()
             ^^^^^^^^^^^^^
RuntimeError: HIP error: the operation cannot be performed in the present state
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
turboderp commented 2 months ago

I'm closing this as stale, since there have been many updates to both TGW and ExLlama in the meantime. Please reopen if the issue remains.