Closed ghost closed 5 months ago
Can you post the code that produced this? My guess is that you are trying to load an unsupported architecture. Try with AutoMode: https://github.com/mobiusml/hqq/?tab=readme-ov-file#auto-mode-1
I just following the basic usage code and modify the model_id
to my model path (a llama model trained by myself):
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
import torch
#Model and setttings
model_id = '/workspace/model/llama-my'
compute_dtype = torch.float16
device = 'cuda:1'
save_dir = './quantized_model'
#Load model on the CPU
######################
model = HQQModelForCausalLM.from_pretrained(model_id, torch_dtype=compute_dtype,trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id,trust_remote_code=True)
#Quantize the model
######################
from hqq.core.quantize import *
quant_config = BaseQuantizeConfig(nbits=4, group_size=128)
model.quantize_model(quant_config=quant_config, compute_dtype=compute_dtype, device=device)
#Save the quantized model
model.save_quantized(model, save_dir=save_dir)
#Load from local directory or Hugging Face Hub on a specific device
model = HQQModelForCausalLM.from_quantized(save_dir, device='cuda')
Can you post the code that produced this? My guess is that you are trying to load an unsupported architecture. Try with AutoMode: https://github.com/mobiusml/hqq/?tab=readme-ov-file#auto-mode-1
It works with AutoMode. Thanks.
Looks like you were not loading a LlamaForCausalLM
model. Only the following https://github.com/mobiusml/hqq/blob/master/hqq/engine/hf.py#L14-L17 are supported by HQQModelForCausalLM
Hi, when I load quantized model using AutoHQQHFModel.from_quantized
, I met the following error:
model = AutoHQQHFModel.from_quantized('/workspace/quantized_model')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.10/dist-packages/hqq/models/base.py", line 315, in from_quantized
model = cls.create_model(save_dir)
File "/usr/local/lib/python3.10/dist-packages/hqq/models/hf/base.py", line 23, in create_model
if len(archs) == 1 and ("CausalLM" in archs[0]):
TypeError: object of type 'NoneType' has no len()
the config.json under quantized_mode
path is:
{
"_name_or_path": "/workspace/model/llama-my",
"attention_bias": true,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 6656,
"initializer_range": 0.02,
"intermediate_size": 22272,
"max_position_embeddings": 32768,
"model_type": "llama",
"num_attention_heads": 52,
"num_hidden_layers": 58,
"num_key_value_heads": 4,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_dim": 64,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.39.0.dev0",
"use_cache": true,
"vocab_size": 65024
}
Seems like you are using a custom non-HF architecture. AutoHQQHFModel
is designed to work with officially supported Hugging Face models (hence the HF in the name).
Can you post the architecture here print(model)
, then I can tell you how to do it manually.
If it's a Hugging Face Llama model, try adding this part in your config and it should work:
model.config.architectures = ['LlamaForCausalLM']
Here is the model architecture:
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(65024, 6656)
(layers): ModuleList(
(0-57): 58 x LlamaDecoderLayer(
(self_attn): LlamaSdpaAttention(
(q_proj): Linear(in_features=6656, out_features=6656, bias=True)
(k_proj): Linear(in_features=6656, out_features=512, bias=True)
(v_proj): Linear(in_features=6656, out_features=512, bias=True)
(o_proj): Linear(in_features=6656, out_features=6656, bias=False)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=6656, out_features=22272, bias=False)
(up_proj): Linear(in_features=6656, out_features=22272, bias=False)
(down_proj): Linear(in_features=22272, out_features=6656, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm()
(post_attention_layernorm): LlamaRMSNorm()
)
)
(norm): LlamaRMSNorm()
)
(lm_head): Linear(in_features=6656, out_features=65024, bias=False)
)
I load the fp16 model and quantize it successdully. But when I want to load the quantized model, the above error occurs.
#Load model on the CPU
######################
model = AutoModelForCausalLM.from_pretrained(
model_path, low_cpu_mem_usage=True, torch_dtype=compute_dtype, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path,trust_remote_code=True)
#Quantize the model
#####################
quant_config = BaseQuantizeConfig(nbits=4, group_size=128)
AutoHQQHFModel.quantize_model(model, quant_config=quant_config,
compute_dtype=compute_dtype,
device=device)
#Save the quantized model
AutoHQQHFModel.save_quantized(model, save_dir)
Did you try ?
model.config.architectures = ['LlamaForCausalLM']
Otherwise, put it in the config json file: https://huggingface.co/meta-llama/Llama-2-7b-hf/blob/main/config.json#L3-L5
What happens is that it doesn't know what kind of architecture it is at load time, that's why it breaks, so the trick above should work. The other option would be using something like (would only work properly if it's the same logic as HF's llama model definition):
from hqq.models.hf.llama import LlamaHQQ
model = LlamaHQQ.from_pretrained(....)
LlamaHQQ.save_quantized(...)
model - LlamaHQQ.from_quantized(...)
Got it.
model.config.architectures = ['LlamaForCausalLM']
works~
Thanks!
Hi, I met the following error when I tried to load a llama model:
I use PyTorch==2.2.0 and Transformers==4.39.0. How to solve this problem? looking forward to your reply.