`RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!` when using Deepspeed ZeRO3 via accelerate

echo-yi commented 3 months ago

System Info

- `Accelerate` version: 0.33.0
- Platform: Linux-5.15.133+-x86_64-with-glibc2.35
- `accelerate` bash location: /opt/conda/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.25.2
- PyTorch version (GPU?): 2.2.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 1842.60 GB
- GPU type: NVIDIA H100 80GB HBM3
- `Accelerate` default config:
    - compute_environment: LOCAL_MACHINE
    - distributed_type: NO
    - mixed_precision: no
    - use_cpu: True
    - debug: False
    - num_processes: 1
    - machine_rank: 0
    - num_machines: 1
    - rdzv_backend: static
    - same_network: False
    - main_training_function: main
    - enable_cpu_affinity: False
    - downcast_bf16: False
    - tpu_use_cluster: False
    - tpu_use_sudo: False

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[ ] My own task or dataset (give details below)

Reproduction

triner.train() throws RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!, which is confusing because I thought ZeRO3 partitons model parameters, so it would be natural to have tensors across different devices.

pretrain.py

...
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_storage=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B-Instruct",
    quantization_config=bnb_config,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)
...
peft_config = LoraConfig(
    r=lora_r,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    target_modules=["q_proj", "v_proj","up_proj","o_proj","k_proj","down_proj","gate_proj"],
    bias="none",
    task_type="CAUSAL_LM",
)
training_args = TrainingArguments(
    output_dir=output_dir,
    max_steps=max_steps,
    num_train_epochs=num_train_epochs,
    logging_steps=logging_steps,
    eval_steps=eval_steps,
    save_steps=save_steps,
    evaluation_strategy='steps',
    per_device_train_batch_size=per_device_train_batch_size,
    per_device_eval_batch_size=per_device_eval_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    eval_accumulation_steps=gradient_accumulation_steps,
    gradient_checkpointing=gradient_checkpointing,
    learning_rate=learning_rate,
    lr_scheduler_type=lr_scheduler_type,
    warmup_ratio=warmup_ratio,
    weight_decay=weight_decay,
    # optim=optim, # You are using ZeRO with an untested optimizer
    bf16=bf16,
    remove_unused_columns=remove_unused_columns,
    run_name=run_name,
    report_to=report_to,
    ddp_find_unused_parameters=False, # RuntimeError: Expected to mark a variable ready only once.
    ddp_timeout=72000, # RuntimeError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
)
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    dataset_text_field="text",
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_args,
)

trainer.train()

zero3_config.yaml

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

command accelerate launch --config_file zero3_config.yaml pretrain.py --num_processes=2 --multi_gpu To be precise, I'm running this command with kubeflow, so

@dsl.container_component
def pretrain():
    return dsl.ContainerSpec(
        image=IMAGE_PATH,
        command=['accelerate', 'launch', '--config_file', 'zero3_config.yaml', 'pretrain.py', '--num_processes=2', '--multi_gpu'])

@dsl.pipeline(name=PIPELINE_NAME,
              description="pretrain",
              pipeline_root=PIPELINE_ROOT,
              )
def pipeline_func(
):
    train_task = pretrain()
    train_task.set_accelerator_type("nvidia.com/gpu")
    train_task.set_accelerator_limit(2)

versions

import bitsandbytes
import accelerate
import transformers
import trl
import peft
import deepspeed
print(f'bitsandbytes=={bitsandbytes.__version__}') # 0.43.2
print(f'accelerate=={accelerate.__version__}') # 0.33.0
print(f'transformers=={transformers.__version__}') # 4.43.2
print(f'trl=={trl.__version__}') # 0.9.6
print(f'peft=={peft.__version__}') # 0.11.1
print(f'deepspeed=={deepspeed.__version__}') # 0.14.0

error log

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
Traceback (most recent call last):
  File "//pretrain.py", line 275, in <module>
    trainer.train()
  File "/opt/conda/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 451, in train
    output = super().train(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 3318, in training_step
    loss = self.compute_loss(model, inputs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 3363, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1852, in forward
    loss = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1561, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 1430, in forward
    return self.base_model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1561, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 179, in forward
    return self.model.forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 169, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1141, in forward
    outputs = self.model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1561, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 169, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 893, in forward
    inputs_embeds = self.embed_tokens(input_ids)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1561, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 169, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 163, in forward
    return F.embedding(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/functional.py", line 2237, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

Expected behavior

train the model without the error above

ArthurinRUC commented 3 months ago

TrainingArguments should initialize before model init.from_pretrained method will init and split model first when using zero-3. Refer to https://huggingface.co/docs/transformers/main/en/deepspeed#zero-configuration

echo-yi commented 3 months ago

I also tried initializing TrainingArguments first like below, but nothing's changed.

training_args = TrainingArguments(
    output_dir=output_dir,
    max_steps=max_steps,
    num_train_epochs=num_train_epochs,
    logging_steps=logging_steps,
    eval_steps=eval_steps,
    save_steps=save_steps,
    evaluation_strategy='steps',
    per_device_train_batch_size=per_device_train_batch_size,
    per_device_eval_batch_size=per_device_eval_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    eval_accumulation_steps=gradient_accumulation_steps,
    gradient_checkpointing=gradient_checkpointing,
    learning_rate=learning_rate,
    lr_scheduler_type=lr_scheduler_type,
    warmup_ratio=warmup_ratio,
    weight_decay=weight_decay,
    # optim=optim, # You are using ZeRO with an untested optimizer
    bf16=bf16,
    remove_unused_columns=remove_unused_columns,
    run_name=run_name,
    report_to=report_to,
    ddp_find_unused_parameters=False, # RuntimeError: Expected to mark a variable ready only once.
    ddp_timeout=72000, # RuntimeError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B-Instruct",
    quantization_config=bnb_config,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)

ArthurinRUC commented 3 months ago

Well in my case that works :) so I am not sure whether if certain config causes your bug.

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / accelerate