mobiusml / hqq

Official implementation of Half-Quadratic Quantization (HQQ)
https://mobiusml.github.io/hqq_blog/
Apache License 2.0
670 stars 65 forks source link

TypeError when load from_pretrain #47

Closed ghost closed 5 months ago

ghost commented 5 months ago

Hi, I met the following error when I tried to load a llama model:

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 59/59 [00:35<00:00,  1.66it/s]
Traceback (most recent call last):
  File "/workspace/code/quant.py", line 11, in <module>
    model     = HQQModelForCausalLM.from_pretrained(model_id, torch_dtype=compute_dtype,trust_remote_code=True)
  File "/usr/local/lib/python3.10/dist-packages/hqq/engine/hf.py", line 67, in from_pretrained
    cls._make_quantizable(model, quantized=False)your PyTorch and Transformers version
  File "/usr/local/lib/python3.10/dist-packages/hqq/engine/hf.py", line 35, in _make_quantizable
    model.arch_key = model.config.architectures[0]
TypeError: 'NoneType' object is not subscriptable

I use PyTorch==2.2.0 and Transformers==4.39.0. How to solve this problem? looking forward to your reply.

mobicham commented 5 months ago

Can you post the code that produced this? My guess is that you are trying to load an unsupported architecture. Try with AutoMode: https://github.com/mobiusml/hqq/?tab=readme-ov-file#auto-mode-1

ghost commented 5 months ago

I just following the basic usage code and modify the model_id to my model path (a llama model trained by myself):

from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
import torch
#Model and setttings
model_id      = '/workspace/model/llama-my'
compute_dtype = torch.float16
device        = 'cuda:1'
save_dir      = './quantized_model'

#Load model on the CPU
######################
model     = HQQModelForCausalLM.from_pretrained(model_id, torch_dtype=compute_dtype,trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id,trust_remote_code=True) 

#Quantize the model
######################
from hqq.core.quantize import *
quant_config = BaseQuantizeConfig(nbits=4, group_size=128)
model.quantize_model(quant_config=quant_config, compute_dtype=compute_dtype, device=device) 

#Save the quantized model
model.save_quantized(model, save_dir=save_dir)

#Load from local directory or Hugging Face Hub on a specific device
model = HQQModelForCausalLM.from_quantized(save_dir, device='cuda')
ghost commented 5 months ago

Can you post the code that produced this? My guess is that you are trying to load an unsupported architecture. Try with AutoMode: https://github.com/mobiusml/hqq/?tab=readme-ov-file#auto-mode-1

It works with AutoMode. Thanks.

mobicham commented 5 months ago

Looks like you were not loading a LlamaForCausalLM model. Only the following https://github.com/mobiusml/hqq/blob/master/hqq/engine/hf.py#L14-L17 are supported by HQQModelForCausalLM

ghost commented 5 months ago

Hi, when I load quantized model using AutoHQQHFModel.from_quantized, I met the following error:

model = AutoHQQHFModel.from_quantized('/workspace/quantized_model')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/hqq/models/base.py", line 315, in from_quantized
    model = cls.create_model(save_dir)
  File "/usr/local/lib/python3.10/dist-packages/hqq/models/hf/base.py", line 23, in create_model
    if len(archs) == 1 and ("CausalLM" in archs[0]):
TypeError: object of type 'NoneType' has no len()

the config.json under quantized_mode path is:

{
  "_name_or_path": "/workspace/model/llama-my",
  "attention_bias": true,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 6656,
  "initializer_range": 0.02,
  "intermediate_size": 22272,
  "max_position_embeddings": 32768,
  "model_type": "llama",
  "num_attention_heads": 52,
  "num_hidden_layers": 58,
  "num_key_value_heads": 4,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_dim": 64,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.39.0.dev0",
  "use_cache": true,
  "vocab_size": 65024
}
mobicham commented 5 months ago

Seems like you are using a custom non-HF architecture. AutoHQQHFModel is designed to work with officially supported Hugging Face models (hence the HF in the name).

Can you post the architecture here print(model), then I can tell you how to do it manually.

mobicham commented 5 months ago

If it's a Hugging Face Llama model, try adding this part in your config and it should work:

model.config.architectures = ['LlamaForCausalLM']
ghost commented 5 months ago

Here is the model architecture:

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(65024, 6656)
    (layers): ModuleList(
      (0-57): 58 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=6656, out_features=6656, bias=True)
          (k_proj): Linear(in_features=6656, out_features=512, bias=True)
          (v_proj): Linear(in_features=6656, out_features=512, bias=True)
          (o_proj): Linear(in_features=6656, out_features=6656, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=6656, out_features=22272, bias=False)
          (up_proj): Linear(in_features=6656, out_features=22272, bias=False)
          (down_proj): Linear(in_features=22272, out_features=6656, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=6656, out_features=65024, bias=False)
)

I load the fp16 model and quantize it successdully. But when I want to load the quantized model, the above error occurs.

#Load model on the CPU
######################
model = AutoModelForCausalLM.from_pretrained(
            model_path, low_cpu_mem_usage=True, torch_dtype=compute_dtype, trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_path,trust_remote_code=True) 

#Quantize the model
#####################
quant_config = BaseQuantizeConfig(nbits=4, group_size=128)

AutoHQQHFModel.quantize_model(model, quant_config=quant_config, 
                                    compute_dtype=compute_dtype, 
                                    device=device)

#Save the quantized model
AutoHQQHFModel.save_quantized(model, save_dir)
mobicham commented 5 months ago

Did you try ?

model.config.architectures = ['LlamaForCausalLM']

Otherwise, put it in the config json file: https://huggingface.co/meta-llama/Llama-2-7b-hf/blob/main/config.json#L3-L5

What happens is that it doesn't know what kind of architecture it is at load time, that's why it breaks, so the trick above should work. The other option would be using something like (would only work properly if it's the same logic as HF's llama model definition):

from hqq.models.hf.llama import LlamaHQQ

model = LlamaHQQ.from_pretrained(....)
LlamaHQQ.save_quantized(...)
model - LlamaHQQ.from_quantized(...)
ghost commented 5 months ago

Got it. model.config.architectures = ['LlamaForCausalLM'] works~ Thanks!