Open supermancmk opened 1 month ago
I have met the same problem with DPO + fsdp. With ZeRO3, it seems that it doen not work, the GPU memory occupation is not correct.
Same error in KTO when set offload_optimizer_device to CPU and offload_param_device to CPU. Have you solved it ?
Hello @supermancmk thanks for raising the issue! I am not able to reproduce it for some reason using your config and your command with --non_eos_penalty
replacing --missing_eos_penalty 1.0
(a recent refactor) 🤔
For reference, I'm running on commit 92eea1f2390fcf3c1a7c4338dfa2e574ce3374c2 and have the following env:
- `transformers` version: 4.44.2
- Platform: Linux-5.15.0-1048-aws-x86_64-with-glibc2.31
- Python version: 3.10.14
- Huggingface_hub version: 0.25.0
- Safetensors version: 0.4.4
- Accelerate version: 0.34.0
- Accelerate config: not found
- DeepSpeed version: 0.15.1
- PyTorch version (GPU?): 2.4.0+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: NVIDIA H100 80GB HBM3
Could you try updating your deepspeed
and trl
versions and see if the problem persists?
@younesbelkada @lvwerra @lewtun @kashif @vwxyzjn @edbeeching @qgallouedec @Michellehbn Hi, I use PPOV2 trainer for PPO and run it according to the command given in examples/scripts/ppo/ppo.py, but set offload_optimizer_device to CPU and offload_param_device to CPU (using Deepspeed Zero3 Offload CPU) in deepseed_zero3.yaml , and no other changes. And the following error occurs: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu!
Reproduce the above error :
Following is my deepspeed_zero3.yaml config (Note: offload_optimizer_device: cpu , offload_param_device: cpu ):
Following is my command :
Following is errors:
https://github.com/huggingface/trl/blob/ddf4c8dc3ecf6d9ee2b24f94c62182ffd682c808/trl/trainer/ppov2_trainer.py#L472 Note when I run the trl/trainer/ppov2_trainer.py: accelerator.backward(loss) , the error will be appear : RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Thanks a lot!