load_in_8bit doesn't work when set device_map

Vinkle-hzt commented 8 months ago

System Info

platform nvidia/cuda:12.1.0-devel-ubuntu22.04 python 3.10.3 transformers 4.38.2

Who can help?

@ArthurZucker and @younesbelkada

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

download model: yahma/llama-7b-hf
load model by following code:
```
import torch
```

from transformers import BitsAndBytesConfig, LlamaForCausalLM

path = "/home/localmodel_path" args = { "device_map": { "model.embed_tokens": "cpu", "model.layers.0": "cpu", "model.layers.1": "cpu", "model.layers.2": "cpu", "model.layers.3": "cpu", "model.layers.4": "cpu", "model.layers.5": "cpu", "model.layers.6": "cpu", "model.layers.7": "cpu", "model.layers.8": "cpu", "model.layers.9": "cpu", "model.layers.10": "cpu", "model.layers.11": "cuda:0", "model.layers.12": "cuda:0", "model.layers.13": "cuda:0", "model.layers.14": "cuda:0", "model.layers.15": "cuda:0", "model.layers.16": "cuda:0", "model.layers.17": "cuda:0", "model.layers.18": "cuda:0", "model.layers.19": "cuda:0", "model.layers.20": "cuda:0", "model.layers.21": "cuda:0", "model.layers.22": "cuda:0", "model.layers.23": "cpu", "model.layers.24": "cpu", "model.layers.25": "cpu", "model.layers.26": "cpu", "model.layers.27": "cpu", "model.layers.28": "cpu", "model.layers.29": "cpu", "model.layers.30": "cpu", "model.layers.31": "cpu", "model.norm": "cpu", "lm_head": "cpu", }, "torch_dtype": torch.float32, "quantization_config": BitsAndBytesConfig( llm_int8_enable_fp32_cpu_offload=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", load_in_4bit=False, load_in_8bit=True, ), }

model = LlamaForCausalLM.from_pretrained(path, **args)


3. get warning message and unexpected GPU MEM use (9G+ expected: about 2.4G)

You are loading your model in 8bit or 4bit but no linear modules were found in your model. Please double check your model architecture, or submit an issue on github if you think this is a bug.


4. by using pdb, you can see model.layer.11 hasn't been loaded in 8bit

(Pdb) p model.model.layers[11].self_attn.q_proj.weight.dtype torch.float32


5. However, when I just map top layers to GPU, it works well
```python
args = {
    "device_map": {
        "model.embed_tokens": "cuda:0",
        "model.layers.0": "cuda:0",
        "model.layers.1": "cuda:0",
        "model.layers.2": "cuda:0",
        "model.layers.3": "cuda:0",
        "model.layers.4": "cuda:0",
        "model.layers.5": "cuda:0",
        "model.layers.6": "cuda:0",
        "model.layers.7": "cuda:0",
        "model.layers.8": "cuda:0",
        "model.layers.9": "cuda:0",
        "model.layers.10": "cuda:0",
        "model.layers.11": "cpu",
        "model.layers.12": "cpu",
        "model.layers.13": "cpu",
        "model.layers.14": "cpu",
        "model.layers.15": "cpu",
        "model.layers.16": "cpu",
        "model.layers.17": "cpu",
        "model.layers.18": "cpu",
        "model.layers.19": "cpu",
        "model.layers.20": "cpu",
        "model.layers.21": "cpu",
        "model.layers.22": "cpu",
        "model.layers.23": "cpu",
        "model.layers.24": "cpu",
        "model.layers.25": "cpu",
        "model.layers.26": "cpu",
        "model.layers.27": "cpu",
        "model.layers.28": "cpu",
        "model.layers.29": "cpu",
        "model.layers.30": "cpu",
        "model.layers.31": "cpu",
        "model.norm": "cpu",
        "lm_head": "cpu",
    },
    "torch_dtype": torch.float32,
    "quantization_config": BitsAndBytesConfig(
        llm_int8_enable_fp32_cpu_offload=True,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        load_in_4bit=False,
        load_in_8bit=True,
    ),
}

(Pdb) p model.model.layers[0].self_attn.q_proj.weight.dtype
torch.int8

Expected behavior

I hope when I set both device_map and quantization_config, it will correctly load the quantized model, even if only the middle layers are map to GPU. Especially, in my example, I want layer.11 to layer.22 to be correctly loaded in int8.

ArthurZucker commented 7 months ago

Pinging @SunMarc as well here!

SunMarc commented 7 months ago

Hi @Vinkle-hzt, thanks for reporting. This should be fixed in the above PR !

huggingface / transformers