[BUG]: bug in training rm with ddp strategy with single machine multi-GPUs!

hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible

https://www.colossalai.org

Apache License 2.0

38.77k stars 4.34k forks source link

[BUG]: bug in training rm with ddp strategy with single machine multi-GPUs! #3421

Open xHansonx opened 1 year ago

xHansonx commented 1 year ago

🐛 Describe the bug

Code:

torchrun --standalone --nproc_per_node=1 train_reward_model.py --dataset Dahoas/rm-static --subset ../../../datasets/Dahoas_rm-static --max_len 512 --model gpt2 --pretrain ../../../gpt2/gpt2-small --lora_rank 0 --max_epochs 1 --batch_size 1 --loss_fn log_sig --test True --need_optim_ckpt True --strategy ddp --save_path rm_ckpt.pt

Error:

Environment

No response

JThh commented 1 year ago

Can I know your environment settings such as your machine type as well as torch, Python versions?

xHansonx commented 1 year ago

Can I know your environment settings such as your machine type as well as torch, Python versions?

------------ Environment ------------ Colossal-AI version: 0.2.8 PyTorch version: 1.12.1 System CUDA version: 11.3 CUDA version required by PyTorch: 11.3

JThh commented 1 year ago

Sorry for getting to your questions late. May I know why you are setting nproc_per_node=1 when you have multiple nodes on the machine and set the strategy to be ddp?

CWHer commented 1 year ago

Thanks for reporting. #4023 Contains this now.