model = LlamaForCausalLM.from_pretrained(path, **args)
3. get warning message and unexpected GPU MEM use (9G+ expected: about 2.4G)
You are loading your model in 8bit or 4bit but no linear modules were found in your model. Please double check your model architecture, or submit an issue on github if you think this is a bug.
4. by using pdb, you can see model.layer.11 hasn't been loaded in 8bit
(Pdb) p model.model.layers[11].self_attn.q_proj.weight.dtype
torch.float32
(Pdb) p model.model.layers[0].self_attn.q_proj.weight.dtype
torch.int8
Expected behavior
I hope when I set both device_map and quantization_config, it will correctly load the quantized model, even if only the middle layers are map to GPU. Especially, in my example, I want layer.11 to layer.22 to be correctly loaded in int8.
System Info
platform nvidia/cuda:12.1.0-devel-ubuntu22.04 python 3.10.3 transformers 4.38.2
Who can help?
@ArthurZucker and @younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
from transformers import BitsAndBytesConfig, LlamaForCausalLM
path = "/home/localmodel_path" args = { "device_map": { "model.embed_tokens": "cpu", "model.layers.0": "cpu", "model.layers.1": "cpu", "model.layers.2": "cpu", "model.layers.3": "cpu", "model.layers.4": "cpu", "model.layers.5": "cpu", "model.layers.6": "cpu", "model.layers.7": "cpu", "model.layers.8": "cpu", "model.layers.9": "cpu", "model.layers.10": "cpu", "model.layers.11": "cuda:0", "model.layers.12": "cuda:0", "model.layers.13": "cuda:0", "model.layers.14": "cuda:0", "model.layers.15": "cuda:0", "model.layers.16": "cuda:0", "model.layers.17": "cuda:0", "model.layers.18": "cuda:0", "model.layers.19": "cuda:0", "model.layers.20": "cuda:0", "model.layers.21": "cuda:0", "model.layers.22": "cuda:0", "model.layers.23": "cpu", "model.layers.24": "cpu", "model.layers.25": "cpu", "model.layers.26": "cpu", "model.layers.27": "cpu", "model.layers.28": "cpu", "model.layers.29": "cpu", "model.layers.30": "cpu", "model.layers.31": "cpu", "model.norm": "cpu", "lm_head": "cpu", }, "torch_dtype": torch.float32, "quantization_config": BitsAndBytesConfig( llm_int8_enable_fp32_cpu_offload=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", load_in_4bit=False, load_in_8bit=True, ), }
model = LlamaForCausalLM.from_pretrained(path, **args)
You are loading your model in 8bit or 4bit but no linear modules were found in your model. Please double check your model architecture, or submit an issue on github if you think this is a bug.
(Pdb) p model.model.layers[11].self_attn.q_proj.weight.dtype torch.float32
Expected behavior
I hope when I set both
device_map
andquantization_config
, it will correctly load the quantized model, even if only the middle layers are map to GPU. Especially, in my example, I want layer.11 to layer.22 to be correctly loaded in int8.