intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System
Other
1.24k stars 156 forks source link

[Error] When using deepspeed to start a megatron training task, only rank 0 of the flash checkpoint saves the model #1199

Open liangxuZhang opened 3 months ago

liangxuZhang commented 3 months ago

Thanks for amazing work to accelerate distributed training. When I use 'deepspeed train.py' to start megatron-lm train task, I get this log image It seems only rank 0 saved the model weight to shared memory, so when save to disk it's blocked. But when I use dlrover-run to start the training task, the flash checkpoint saves the model weights normally. image

Using Megatron-LM 0.6.0 and dlrover 0.3.6rc0

workingloong commented 3 months ago

You can check whether other ranks have the non-empty state dict when calling save_checkpoint.

github-actions[bot] commented 2 days ago

This issue has been automatically marked as stale because it has not had recent activity.