huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
9.34k stars 1.17k forks source link

PPOV2 Trainner use Deepspeed Zero3 Offload CPU: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu! #1891

Open supermancmk opened 1 month ago

supermancmk commented 1 month ago

@younesbelkada @lvwerra @lewtun @kashif @vwxyzjn @edbeeching @qgallouedec @Michellehbn Hi, I use PPOV2 trainer for PPO and run it according to the command given in examples/scripts/ppo/ppo.py, but set offload_optimizer_device to CPU and offload_param_device to CPU (using Deepspeed Zero3 Offload CPU) in deepseed_zero3.yaml , and no other changes. And the following error occurs: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu!

Reproduce the above error :

Following is my deepspeed_zero3.yaml config (Note: offload_optimizer_device: cpu , offload_param_device: cpu ):

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Following is my command :

accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml \
    examples/scripts/ppo/ppo.py \
    --output_dir models/minimal/ppo \
    --num_ppo_epochs 1 \
    --num_mini_batches 1 \
    --learning_rate 3e-6 \
    --per_device_train_batch_size 5 \
    --gradient_accumulation_steps 1 \
    --total_episodes 10000 \
    --model_name_or_path EleutherAI/pythia-1b-deduped \
    --sft_model_path EleutherAI/pythia-1b-deduped \
    --reward_model_path EleutherAI/pythia-1b-deduped \
    --local_rollout_forward_batch_size 5 \
    --non_eos_penalty \

Following is errors:

Traceback (most recent call last):
  File "/examples/scripts/ppo/ppo.py", line 115, in <module>
    trainer.train()
  File "trl/trainer/ppov2_trainer.py", line 494, in train
    **accelerator.backward(loss)** 
  File "/lib/python3.10/site-packages/accelerate/accelerator.py", line 2151, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 175, in backward
    self.engine.step()
  File "/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2169, in step
    self._take_model_step(lr_kwargs)
  File "/python3.10/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
    self.optimizer.step()
  File "/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2047, in step
    self.unscale_and_clip_grads(sub_group_id, scaled_global_grad_norm)
  File "/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads
    self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

https://github.com/huggingface/trl/blob/ddf4c8dc3ecf6d9ee2b24f94c62182ffd682c808/trl/trainer/ppov2_trainer.py#L472 Note when I run the trl/trainer/ppov2_trainer.py: accelerator.backward(loss) , the error will be appear : RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Thanks a lot!

FlyingDutchman26 commented 1 month ago

I have met the same problem with DPO + fsdp. With ZeRO3, it seems that it doen not work, the GPU memory occupation is not correct.

renmengjie7 commented 4 days ago

Same error in KTO when set offload_optimizer_device to CPU and offload_param_device to CPU. Have you solved it ?

lewtun commented 9 hours ago

Hello @supermancmk thanks for raising the issue! I am not able to reproduce it for some reason using your config and your command with --non_eos_penalty replacing --missing_eos_penalty 1.0 (a recent refactor) 🤔

For reference, I'm running on commit 92eea1f2390fcf3c1a7c4338dfa2e574ce3374c2 and have the following env:

- `transformers` version: 4.44.2
- Platform: Linux-5.15.0-1048-aws-x86_64-with-glibc2.31
- Python version: 3.10.14
- Huggingface_hub version: 0.25.0
- Safetensors version: 0.4.4
- Accelerate version: 0.34.0
- Accelerate config:    not found
- DeepSpeed version: 0.15.1
- PyTorch version (GPU?): 2.4.0+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: NVIDIA H100 80GB HBM3

Could you try updating your deepspeed and trl versions and see if the problem persists?