huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.79k stars 26.96k forks source link

load_in_8bit doesn't work when set device_map #29691

Closed Vinkle-hzt closed 7 months ago

Vinkle-hzt commented 8 months ago

System Info

platform nvidia/cuda:12.1.0-devel-ubuntu22.04 python 3.10.3 transformers 4.38.2

Who can help?

@ArthurZucker and @younesbelkada

Information

Tasks

Reproduction

  1. download model: yahma/llama-7b-hf
  2. load model by following code:
    
    import torch

from transformers import BitsAndBytesConfig, LlamaForCausalLM

path = "/home/localmodel_path" args = { "device_map": { "model.embed_tokens": "cpu", "model.layers.0": "cpu", "model.layers.1": "cpu", "model.layers.2": "cpu", "model.layers.3": "cpu", "model.layers.4": "cpu", "model.layers.5": "cpu", "model.layers.6": "cpu", "model.layers.7": "cpu", "model.layers.8": "cpu", "model.layers.9": "cpu", "model.layers.10": "cpu", "model.layers.11": "cuda:0", "model.layers.12": "cuda:0", "model.layers.13": "cuda:0", "model.layers.14": "cuda:0", "model.layers.15": "cuda:0", "model.layers.16": "cuda:0", "model.layers.17": "cuda:0", "model.layers.18": "cuda:0", "model.layers.19": "cuda:0", "model.layers.20": "cuda:0", "model.layers.21": "cuda:0", "model.layers.22": "cuda:0", "model.layers.23": "cpu", "model.layers.24": "cpu", "model.layers.25": "cpu", "model.layers.26": "cpu", "model.layers.27": "cpu", "model.layers.28": "cpu", "model.layers.29": "cpu", "model.layers.30": "cpu", "model.layers.31": "cpu", "model.norm": "cpu", "lm_head": "cpu", }, "torch_dtype": torch.float32, "quantization_config": BitsAndBytesConfig( llm_int8_enable_fp32_cpu_offload=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", load_in_4bit=False, load_in_8bit=True, ), }

model = LlamaForCausalLM.from_pretrained(path, **args)


3. get warning message and unexpected GPU MEM use (9G+ expected: about 2.4G)

You are loading your model in 8bit or 4bit but no linear modules were found in your model. Please double check your model architecture, or submit an issue on github if you think this is a bug.


4. by using pdb, you can see model.layer.11 hasn't been loaded in 8bit

(Pdb) p model.model.layers[11].self_attn.q_proj.weight.dtype torch.float32


5. However, when I just map top layers to GPU, it works well
```python
args = {
    "device_map": {
        "model.embed_tokens": "cuda:0",
        "model.layers.0": "cuda:0",
        "model.layers.1": "cuda:0",
        "model.layers.2": "cuda:0",
        "model.layers.3": "cuda:0",
        "model.layers.4": "cuda:0",
        "model.layers.5": "cuda:0",
        "model.layers.6": "cuda:0",
        "model.layers.7": "cuda:0",
        "model.layers.8": "cuda:0",
        "model.layers.9": "cuda:0",
        "model.layers.10": "cuda:0",
        "model.layers.11": "cpu",
        "model.layers.12": "cpu",
        "model.layers.13": "cpu",
        "model.layers.14": "cpu",
        "model.layers.15": "cpu",
        "model.layers.16": "cpu",
        "model.layers.17": "cpu",
        "model.layers.18": "cpu",
        "model.layers.19": "cpu",
        "model.layers.20": "cpu",
        "model.layers.21": "cpu",
        "model.layers.22": "cpu",
        "model.layers.23": "cpu",
        "model.layers.24": "cpu",
        "model.layers.25": "cpu",
        "model.layers.26": "cpu",
        "model.layers.27": "cpu",
        "model.layers.28": "cpu",
        "model.layers.29": "cpu",
        "model.layers.30": "cpu",
        "model.layers.31": "cpu",
        "model.norm": "cpu",
        "lm_head": "cpu",
    },
    "torch_dtype": torch.float32,
    "quantization_config": BitsAndBytesConfig(
        llm_int8_enable_fp32_cpu_offload=True,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        load_in_4bit=False,
        load_in_8bit=True,
    ),
}
(Pdb) p model.model.layers[0].self_attn.q_proj.weight.dtype
torch.int8

Expected behavior

I hope when I set both device_map and quantization_config, it will correctly load the quantized model, even if only the middle layers are map to GPU. Especially, in my example, I want layer.11 to layer.22 to be correctly loaded in int8.

ArthurZucker commented 7 months ago

Pinging @SunMarc as well here!

SunMarc commented 7 months ago

Hi @Vinkle-hzt, thanks for reporting. This should be fixed in the above PR !