turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.53k stars 272 forks source link

Cannot load emma model with latest version #350

Closed techvd closed 5 months ago

techvd commented 7 months ago

I'm using the latest version (exllamav2-0.0.13.post2+cu121-cp311-cp311-win_amd64.whl) on Windows and cannot seem to load gemma 7b models. I tried both the 6bpw and 4bpw versions from https://huggingface.co/turboderp/Gemma-7B-it-exl2/tree/main. Is there anything that needs to change or be updated? This is custom python code using my own scripts.

    self.model.load_autosplit(self.cache)
  File "C:\Users\techv\miniconda3\envs\pytorch\Lib\site-packages\exllamav2\model.py", line 292, in load_autosplit
    for item in f: x = item
  File "C:\Users\techv\miniconda3\envs\pytorch\Lib\site-packages\exllamav2\model.py", line 377, in load_autosplit_gen
    module.load()
  File "C:\Users\techv\miniconda3\envs\pytorch\Lib\site-packages\exllamav2\attn.py", line 189, in load
    self.q_proj.load()
  File "C:\Users\techv\miniconda3\envs\pytorch\Lib\site-packages\exllamav2\linear.py", line 51, in load
    self.q_handle = ext.make_q_matrix(w, self.temp_dq)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\techv\miniconda3\envs\pytorch\Lib\site-packages\exllamav2\ext.py", line 209, in make_q_matrix
    return ext_c.make_q_matrix(w["q_weight"],
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Insufficient size of temp_dq buffer
turboderp commented 7 months ago

There isn't a prebuilt wheel yet that supports Gemma, you'd have to build from source. I'm bumping the version soon though. Probably within the next few hours, just dotting some Ts, be patient. :)

turboderp commented 7 months ago

(Latest release should support Gemma now.)

codenoid commented 7 months ago

Sorry, are there something wrong? I use latest master branch with the turboderp/Gemma-7B-exl2 4.0bpw

image

turboderp commented 7 months ago

It's hard to say, since that isn't the instruct model, so results using the instruct template are kind of unpredictable. I haven't heard of anyone getting really good results with either version, though, whether in ExLlama (with or without quantization), gemma.cpp or Transformers. It seems to be especially bad at back-and-forth. Here's an example from the instruct version:

fJu6Fwh

I would suggest you try the instruct version first of all.

codenoid commented 7 months ago

my bad, I didn't use the instruct version, it's good now

image

turboderp commented 7 months ago

Don't know if I'd call that good, but it seems to be working at least. The instruct finetune seems rather weak, and for tasks like that you'd probably get better results with a fewshot prompt for the base model.