RuntimeError: Expected all tensors to be on the same device using Deepspeed with QLoRA and DPOTrainer

nnethercott commented 7 months ago

Couldn't find any similar other issues in accelerate, peft, or trl so I'm opening one here. When using the DPOTrainer on a single GPU with QLoRA I have no issues, but when I try to run the script with accelerate + deepspeed I keep getting "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:0!".

main.py

import torch

from transformers import , AutoTokenizer, TrainingArguments, AutoModelForCausalLM
from transformers import BitsAndBytesConfig
from datasets import load_dataset
from peft import LoraConfig, PeftModel, 
from trl import DPOTrainer
import bitsandbytes as bnb

model_name = "lmsys/vicuna-7b-v1.5"

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=False
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype = torch.float16,
    quantization_config = quantization_config,
    )

model.enable_input_require_grads()

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.padding_side = "left"

dataset = load_dataset("argilla/distilabel-intel-orca-dpo-pairs")["train"]

# dataset prep
def process(example):
    # Format system
    if len(example['system']) > 0:
        message = {"role": "system", "content": example['system']}
        system = tokenizer.apply_chat_template([message], tokenize=False)
    else:
        system = ""

    # Format instruction
    message = {"role": "user", "content": example['input']}
    prompt = tokenizer.apply_chat_template([message], tokenize=False, add_generation_prompt=True)

    # Format chosen answer
    chosen = example['chosen'] + tokenizer.eos_token

    # Format rejected answer
    rejected = example['rejected'] + tokenizer.eos_token

    return {
        "prompt": system + prompt,
        "chosen": chosen,
        "rejected": rejected,
    }
dataset = dataset.map(process, remove_columns = dataset.column_names, batched=False)

# LoRA configuration
peft_config = LoraConfig(
    r=48,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['k_proj', 'v_proj', 'q_proj'],
)

# Training arguments
training_args = TrainingArguments(
    per_device_train_batch_size=1,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False}, #https://github.com/huggingface/trl/issues/1136
    fp16 = True,
    learning_rate=5e-5,
    lr_scheduler_type="cosine",
    max_steps=400,
    save_strategy="steps",
    save_steps = 400,
    save_total_limit=1,
    logging_steps=1,
    output_dir="./new_model",
    warmup_ratio=0.03,
    report_to="none",
    deepspeed = "./zero2.json",
)

# Create DPO trainer
dpo_trainer = DPOTrainer(
    model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
    peft_config=peft_config,
    beta=0.1,
    max_prompt_length=512,
    max_length=1024,
    dataset_num_proc=4,
)

# Fine-tune model with DPO
dpo_trainer.train()

zero2.json

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "train_micro_batch_size_per_gpu": "auto",
    "train_batch_size": "auto",
    "gradient_accumulation_steps": "auto",
    "zero_optimization": {
        "stage": 2,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto"
    }
}

accelerate config.yaml

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_config_file: ./zero2.json
  zero3_init_flag: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

When commenting out deepspeed = "./zero2.json" in the TrainingArgs and executing the command below I have no issues;

CUDA_VISIBLE_DEVICES=0 python main.py

Instead if I run the script above using both the accelerate cli or deepspeed cli I get the same error:

accelerate launch --config_file ./config.yaml main.py

or

deepspeed main.py

both give me the following stack trace:

Stack Trace

Traceback (most recent call last):
  File "/home/nathaniel/llava/dpo-slerp/./vicuna_dpo.py", line 107, in <module>
    dpo_trainer.train()
  File "/home/nathaniel/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1624, in train
    return inner_training_loop(
  File "/home/nathaniel/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1961, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/nathaniel/.local/lib/python3.9/site-packages/transformers/trainer.py", line 2902, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/nathaniel/.local/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1077, in compute_loss
    loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
  File "/home/nathaniel/.local/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1018, in get_batch_loss_metrics
    ) = self.concatenated_forward(model, batch)
  File "/home/nathaniel/.local/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 981, in concatenated_forward
    all_logits = model(
  File "/home/nathaniel/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/nathaniel/.local/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/nathaniel/.local/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1842, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/nathaniel/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/nathaniel/.local/lib/python3.9/site-packages/peft/peft_model.py", line 1083, in forward
    return self.base_model(
  File "/home/nathaniel/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/nathaniel/.local/lib/python3.9/site-packages/peft/tuners/tuners_utils.py", line 161, in forward
    return self.model.forward(*args, **kwargs)
  File "/home/nathaniel/.local/lib/python3.9/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/nathaniel/.local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 1168, in forward
    outputs = self.model(
  File "/home/nathaniel/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/nathaniel/.local/lib/python3.9/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/nathaniel/.local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 966, in forward
    inputs_embeds = self.embed_tokens(input_ids)
  File "/home/nathaniel/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/nathaniel/.local/lib/python3.9/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/nathaniel/.local/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/home/nathaniel/.local/lib/python3.9/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)

Based on the accelerate deepspeed integration guides and other tutorials I've seen I was expecting the switch to deepspeed above to run without the above error.

nnethercott commented 7 months ago

Should also add this I guess

 `Accelerate` version: 0.27.2
- Platform: Linux-5.10.0-28-cloud-amd64-x86_64-with-glibc2.31
- Python version: 3.9.2
- Numpy version: 1.24.4
- PyTorch version (GPU?): 2.0.1+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 188.71 GB
- GPU type: NVIDIA L4
- `Accelerate` default config:
    - compute_environment: LOCAL_MACHINE
    - distributed_type: DEEPSPEED
    - use_cpu: False
    - debug: False
    - num_processes: 4
    - machine_rank: 0
    - num_machines: 1
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - deepspeed_config: {'deepspeed_config_file': '/home/nathaniel/llava/dpo-slerp/zero2.json', 'zero3_init_flag': True}
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []

Other package versions:

transformers==4.38.1
peft==0.8.2
trl==0.7.11

BenjaminBossan commented 7 months ago

I don't have experience with DeepSpeed, so I can't really help you here. But I wanted to mention that we're currently adding a PEFT + DS guide to the PEFT docs, maybe you can find something useful in there.

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

spomichter commented 2 months ago

Was this issue solved?

huggingface / accelerate

RuntimeError: Expected all tensors to be on the same device using Deepspeed with QLoRA and DPOTrainer #2482