turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.18k stars 233 forks source link

Qauntization in glm4-9b failed #489

Open Orion-zhen opened 3 weeks ago

Orion-zhen commented 3 weeks ago

As I run the convert.py with command: CUDA_VISIBLE_DEVICES=1 python convert.py -i /home/orion/ai/Models/glm4-9b -o ./tmp-file -cf /home/orion/ai/Models/glm4-9b-4-exl2 -r 256, it runs into an error saying TypeError: Value for eos_token_id is not of expected type <class 'int'>.

It seems that the architecture of glm4 hasn't been supported yet.

Steps to reproduce: Just download the glm4-9b model and run the convert.py as README says.

Full console log:

 !! Warning, unknown architecture: ChatGLMModel
 !! Loading as LlamaForCausalLM
Traceback (most recent call last):
  File "/home/orion/repo/exllamav2/convert.py", line 71, in <module>
    config.prepare()
  File "/home/orion/repo/exllamav2/exllamav2/config.py", line 187, in prepare
    self.eos_token_id = read(read_config, int, "eos_token_id", None)  # 2
  File "/home/orion/repo/exllamav2/exllamav2/config.py", line 40, in read
    raise TypeError(f"Value for {key} is not of expected type {expected_type}")
TypeError: Value for eos_token_id is not of expected type <class 'int'>
turboderp commented 3 weeks ago

Yeah, the architecture isn't supported. There's a bunch of little things that would have to be updated, like how the EOS token is a list all of a sudden, scaled attention layers and such. It's not high on the list of priorities at the moment. Not sure if the model is any good, or if it's any good without the multimodal capabilities which wouldn't be supported anyway.

Orion-zhen commented 3 weeks ago

As the THUDM declares, the glm4-9b could out-perform llama3-8b. So it might worth a try. BTW, would it be possible to add multimodal support to exllamav2 in the future? I see that multimodal llm (vlm) could be the next trend.

turboderp commented 2 weeks ago

Multimodal is possible, of course, as is GLM4 in general, along with diffusion models, TTS, you name it. I just have to prioritize. But contributions are always welcome.