huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.32k stars 872 forks source link

False device placement when use with quantization_config #2905

Open xinghaow99 opened 2 days ago

xinghaow99 commented 2 days ago

System Info

- `Accelerate` version: 0.31.0
- Platform: Linux-5.4.0-148-generic-x86_64-with-glibc2.31
- `accelerate` bash location: /remote-home/xhwang/anaconda3/envs/gloq/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 1007.54 GB
- GPU type: NVIDIA A800-SXM4-80GB
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: no
        - use_cpu: False
        - debug: False
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

Tasks

Reproduction

Hi! I want to load my model to cpu initially and use ddp for some sub modules with accelerator.prepare() later. Here is an simple reproduction:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from accelerate import Accelerator
import torch
model = AutoModelForCausalLM.from_pretrained(
        'models/Llama-2-7b-hf-2bit-64rank-5iter', # base model obtained by LoftQ, should be equivalent to 'LoftQ/Llama-2-7b-hf-2bit-64rank'
        torch_dtype=torch.bfloat16,
        device_map='cpu',
        quantization_config=BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=False,
            bnb_4bit_quant_type='nf4'
        ),
    )
sub_module = model.model.layers[10]
accelerator = Accelerator()
sub_module = accelerator.prepare(sub_module)
print(sub_module.device) # cuda:0
dummy_inputs = torch.randn(1, 2048, 4096).to(accelerator.device)
position_ids = torch.arange(2048).unsqueeze(0).to(accelerator.device)
output = sub_module(dummy_inputs, position_ids=position_ids)

Got RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

I see that the tensors are sent to cpu by some device alignment hooks added by accelerate. Is this expected? Or any workaround? Thanks for any help!

Expected behavior

Model and tensors should both be on cuda.

SunMarc commented 1 day ago

Hi @xinghaow99, this should not be possible to load bnb model on cpu for now.

If you run the following, you will get the error. I will fix the missing check :

from transformers import AutoModelForCausalLM, BitsAndBytesConfig from accelerate import Accelerator import torch model = AutoModelForCausalLM.from_pretrained( 'models/Llama-2-7b-hf-2bit-64rank-5iter', # base model obtained by LoftQ, should be equivalent to 'LoftQ/Llama-2-7b-hf-2bit-64rank' torch_dtype=torch.bfloat16, device_map={"":"cpu"}, quantization_config=BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=False, bnb_4bit_quant_type='nf4' ), )

I recommend loading your model directly on gpu by setting device_map = {"":"cuda"}

xinghaow99 commented 1 day ago

@SunMarc Hi! Thank you for getting back to me. I'm trying to load submodules to GPUs dynamically(only moving them to GPU when computing) to save GPU RAM since I'm training the model layer by layer. Guess this is not supported for now...