Model loading on meta device

RonanKMcGovern commented 11 months ago

System Info

A6000 GPU on runpod.

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

transformers version: 4.35.0.dev0
Platform: Linux-5.4.0-153-generic-x86_64-with-glibc2.35
Python version: 3.10.6
Huggingface_hub version: 0.17.3
Safetensors version: 0.4.0
Accelerate version: 0.25.0.dev0
Accelerate config: not found
PyTorch version (GPU?): 2.0.1+cu118 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes A6000
Using distributed or parallel set-up in script?: Only one GPU, so shouldn't be relevant, but somehow the model is getting loaded to cpu at least in part.

Who can help?

@ArthurZucker @younesbelkada

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

!pip install -U -q git+https://github.com/huggingface/transformers.git

!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets
!pip install -q -U scipy
!pip install -U flash-attn -q
!pip install -q -U trl

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, AutoConfig
import torch

model_id  = "tiiuae/falcon-7b"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

config = AutoConfig.from_pretrained(model_id)
config.max_position_embeddings = 4096 # (input + output) tokens can now be up to 4096

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    config=config,
    quantization_config=bnb_config,
    # rope_scaling={"type": "linear", "factor": 2.0},
    device_map='auto',
    # trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    use_flash_attention_2=True, # works with Llama models and reduces memory reqs
    cache_dir=cache_dir)

Expected behavior

I would expect this model to easily fit on an A6000 with 48GB of VRAM.

Instead, I get this error/notification:

WARNING:root:Some parameters are on the meta device device because they were offloaded to the .
WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu/disk.

younesbelkada commented 11 months ago

hi @RonanKMcGovern thanks for your issue I ran:

import torch
from transformers import AutoModelForCausalLM

model_id  = "tiiuae/falcon-7b"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_4bit=True,
    torch_dtype=torch.bfloat16,
)

for n, p in model.named_parameters():
    if p.device.type == "meta":
        print(f"{n} is on meta!")

and I can confirm I had no parameter on the meta device while having the same error message you shared. Perhaps it is a bug at accelerate. Can you file an issue there and use this small handy snippet?

RonanKMcGovern commented 11 months ago

done, thanks: https://github.com/huggingface/accelerate/issues/2103

huggingface / transformers