False device placement when use with quantization_config

xinghaow99 commented 2 days ago

System Info

- `Accelerate` version: 0.31.0
- Platform: Linux-5.4.0-148-generic-x86_64-with-glibc2.31
- `accelerate` bash location: /remote-home/xhwang/anaconda3/envs/gloq/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 1007.54 GB
- GPU type: NVIDIA A800-SXM4-80GB
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: no
        - use_cpu: False
        - debug: False
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

Hi! I want to load my model to cpu initially and use ddp for some sub modules with accelerator.prepare() later. Here is an simple reproduction:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from accelerate import Accelerator
import torch
model = AutoModelForCausalLM.from_pretrained(
        'models/Llama-2-7b-hf-2bit-64rank-5iter', # base model obtained by LoftQ, should be equivalent to 'LoftQ/Llama-2-7b-hf-2bit-64rank'
        torch_dtype=torch.bfloat16,
        device_map='cpu',
        quantization_config=BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=False,
            bnb_4bit_quant_type='nf4'
        ),
    )
sub_module = model.model.layers[10]
accelerator = Accelerator()
sub_module = accelerator.prepare(sub_module)
print(sub_module.device) # cuda:0
dummy_inputs = torch.randn(1, 2048, 4096).to(accelerator.device)
position_ids = torch.arange(2048).unsqueeze(0).to(accelerator.device)
output = sub_module(dummy_inputs, position_ids=position_ids)

Got RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

I see that the tensors are sent to cpu by some device alignment hooks added by accelerate. Is this expected? Or any workaround? Thanks for any help!

Expected behavior

Model and tensors should both be on cuda.

SunMarc commented 1 day ago

Hi @xinghaow99, this should not be possible to load bnb model on cpu for now.

If you run the following, you will get the error. I will fix the missing check :

from transformers import AutoModelForCausalLM, BitsAndBytesConfig from accelerate import Accelerator import torch model = AutoModelForCausalLM.from_pretrained( 'models/Llama-2-7b-hf-2bit-64rank-5iter', # base model obtained by LoftQ, should be equivalent to 'LoftQ/Llama-2-7b-hf-2bit-64rank' torch_dtype=torch.bfloat16, device_map={"":"cpu"}, quantization_config=BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=False, bnb_4bit_quant_type='nf4' ), )

I recommend loading your model directly on gpu by setting device_map = {"":"cuda"}

xinghaow99 commented 1 day ago

@SunMarc Hi! Thank you for getting back to me. I'm trying to load submodules to GPUs dynamically(only moving them to GPU when computing) to save GPU RAM since I'm training the model layer by layer. Guess this is not supported for now...

huggingface / accelerate