Open h-sinha22 opened 3 months ago
Could you share a full reproducer?
model config used:
{ "_name_or_path": "/app/mnt/models_cache/bigcode/starcoder2-7b", "activation_function": "gelu", "architectures": [ "Starcoder2ForCausalLM" ], "attention_dropout": 0.1, "attention_softmax_in_fp32": true, "bos_token_id": 0, "embedding_dropout": 0.1, "eos_token_id": 0, "hidden_act": "gelu_pytorch_tanh", "hidden_size": 4608, "initializer_range": 0.018042, "intermediate_size": 18432, "layer_norm_epsilon": 1e-05, "max_position_embeddings": 16384, "mlp_type": "default", "model_type": "starcoder2", "norm_epsilon": 1e-05, "norm_type": "layer_norm", "num_attention_heads": 36, "num_hidden_layers": 32, "num_key_value_heads": 4, "quantization_config": { "_load_in_4bit": false, "_load_in_8bit": false, "bnb_4bit_compute_dtype": "float32", "bnb_4bit_quant_storage": "uint8", "bnb_4bit_quant_type": "fp4", "bnb_4bit_use_double_quant": false, "llm_int8_enable_fp32_cpu_offload": false, "llm_int8_has_fp16_weight": false, "llm_int8_skip_modules": null, "llm_int8_threshold": 6.0, "load_in_4bit": false, "load_in_8bit": false, "quant_method": "bitsandbytes" }, "residual_dropout": 0.1, "rope_theta": 1000000, "scale_attention_softmax_in_fp32": true, "scale_attn_weights": true, "sliding_window": 4096, "torch_dtype": "bfloat16", "transformers_version": "4.39.1", "use_bias": true, "use_cache": true, "vocab_size": 49152 }
That is not a full reproducer, we need the full code that you are running
I meet the same error ,
package version: bitsandbytes 0.43.1 transformers 4.40.0 torch 2.2.2+cu118 torchaudio 2.2.2+cu118 torchvision 0.17.2+cu118
My actions are as follows
First i use quantization code to quantize Chinese-Llama-2-7b to Chinese-Llama-2-7b-4bits. this is my quantization code:
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig,
model_id = "LinkSoul/Chinese-Llama-2-7b"
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
quantization_config = BitsAndBytesConfig(
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
),
device_map='auto'
)
if __name__ == '__main__':
import os
output = "soulteary/Chinese-Llama-2-7b-4bit"
if not os.path.exists(output):
os.mkdir(output)
model.save_pretrained(output)
print("done")
**then i get the quantized model: soulteary/Chinese-Llama-2-7b-4bit
and i want use transformers to load the soulteary/Chinese-Llama-2-7b-4bit ,after i use next code。**
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer,BitsAndBytesConfig
model_id = 'soulteary/Chinese-Llama-2-7b-4bit'
if torch.cuda.is_available():
quantization_config = BitsAndBytesConfig(
bnb_4bit_quant_type="'bitsandbytes_4bit",
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=quantization_config,
local_files_only=True,
torch_dtype=torch.float16,
device_map='auto'
)
else:
else:
model = None
the erros appears:
Traceback (most recent call last):
File "/home/soikit/LLM/app.py", line 6, in
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Similar issue... Not able to load after saving 4bit.
ValueError: Supplied state dict for model.layers.16.self_attn.vision_expert_dense.weight does not contain `bitsandbytes__*` and possibly other `quantized_stats` components.
cc @SunMarc and @younesbelkada
Hi @1049451037 can you share a simple and short reproducible snippet? Can you also try with latest transformers pip install -U transformers
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
quant_method='nf4'
)
model = AutoModelForCausalLM.from_pretrained('THUDM/cogvlm2-llama3-chat-19B', quantization_config=quant_config)
tokenizer = AutoTokenizer.from_pretrained('THUDM/cogvlm2-llama3-chat-19B')
# save int4
model.save_pretrained('./cogvlm2-llama3-chat-19B-int4')
tokenizer.save_pretrained('./cogvlm2-llama3-chat-19B-int4')
# load failed
model = AutoModelForCausalLM.from_pretrained('./cogvlm2-llama3-chat-19B-int4', quantization_config=quant_config)
tokenizer = AutoTokenizer.from_pretrained('./cogvlm2-llama3-chat-19B-int4')
On it !
cc @SunMarc
System Info
I am running on A100 with 40 GB GPU memory
Who can help?
@SunMarc and @younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
1- I have a SFT tuned starcoder2 model 2- I am trying to load the same using AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path)
Expected behavior
It should be able to load the model properly.