intel / intel-extension-for-transformers

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡
Apache License 2.0
2.14k stars 211 forks source link

Fails to load saved model : Trying to set a tensor of shape torch.Size([1376, 4096]) in "qweight" (which has shape torch.Size([4096, 1376])), this look incorrect. #1407

Open kranipa opened 7 months ago

kranipa commented 7 months ago

Loading saved model runs into following error It also takes a very long time to run and save quantized models.

2024-03-21 08:48:58 [INFO] loading weights file models/4_bit_llama2-rtn/model.safetensors
2024-03-21 08:48:58 [ERROR] Trying to set a tensor of shape torch.Size([1376, 4096]) in "qweight" (which has shape torch.Size([4096, 1376])), this look incorrect.
2024-03-21 08:48:58 [ERROR] Saved low bit model loading failed, please check your model.

Tried following example.

import torch
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, RtnConfig, GPTQConfig, AwqConfig

model_path = "meta-llama/Llama-2-7b-chat-hf" # your_pytorch_model_path_or_HF_model_name
saved_dir = "models/4_bit_llama2-rtn" # your_saved_model_dir
#model_path  = "Intel/neural-chat-7b-v3-3" 
#saved_dir = "models/4_bit_neural_chat_7b-v3-3-rtn"
# quant
woq_config = RtnConfig(bits=4, compute_dtype="int8", scale_dtype='fp32', group_size=32)
model = AutoModelForCausalLM.from_pretrained(model_path, 
                                            device_map='cpu',
                                            torch_dtype=torch.float16,
                                            quantization_config=woq_config, 
                                            trust_remote_code=True,
                                            use_neural_speed=False)
# save quant model
model.save_pretrained(saved_dir)
load quant model
loaded_model = AutoModelForCausalLM.from_pretrained(saved_dir,trust_remote_code = True)
intel-extension-for-transformers ==1.4rc2.dev8+g494a5712fa2
neural-compressor==2.4.1
neural-speed==0.4.dev21+g0ec1a6e
intellinjun commented 7 months ago

model = AutoModelForCausalLM.from_pretrained(model_path, device_map='cpu', torch_dtype=torch.float16, quantization_config=woq_config, trust_remote_code=True, _use_neural_speed=False_) Do you want to use neural_speed? If yes, try to use neural speed = True.

kranipa commented 7 months ago

Thank you for the response.

using use_neural_speed=True save function doesnt work.

I get following error

AttributeError: 'Model' object has no attribute 'save_pretrained'

can you share an example how to save quantized model ( Model object.) with neural_speed

kevinintel commented 7 months ago

It looks like load/save mismatch, can you try to use latest commit instead of g494a5712fa2 and set use_neural_speed=False?

kranipa commented 7 months ago

Hi, Thank you. Saving works, however loading the saved model leads to following error


    raise ValueError(
ValueError: Unknown quantization type, got rtn - supported types are: ['awq', 'bitsandbytes_4bit', 'bitsandbytes_8bit', 'gptq', 'aqlm']

following is the code snippet

import torch
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, RtnConfig, GPTQConfig, AwqConfig

model_path = "meta-llama/Llama-2-7b-chat-hf" # your_pytorch_model_path_or_HF_model_name
saved_dir = "models/4_bit_llama2-rtn" # your_saved_model_dir
#model_path  = "Intel/neural-chat-7b-v3-3" 
#saved_dir = "models/4_bit_neural_chat_7b-v3-3-rtn"
# quant
woq_config = RtnConfig(bits=4)
model = AutoModelForCausalLM.from_pretrained(model_path, 
                                            device_map='cpu',
                                            #torch_dtype=torch.float16,
                                            quantization_config=woq_config, 
                                            trust_remote_code=True,
                                            use_neural_speed=False)
# save quant model
model.save_pretrained(saved_dir)
#load quant model
loaded_model = AutoModelForCausalLM.from_pretrained(saved_dir,trust_remote_code = True)
PenghuiCheng commented 7 months ago

@kranipa , This issue is caused by mismatch the version of ITREX and neural-compressor. You can use neural-compressor version 2.5.1 and try it again. ITREX 1.4 is released now, Please try it. thanks very much.

kranipa commented 6 months ago

okay , thank you.

PhzCode commented 5 months ago

@kranipa Did you get it to run? I'm having the same problem.

PenghuiCheng commented 4 months ago

@PhzCode , could you post your code and let me try to reproduce it. thanks very much.