intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System
Other
1.23k stars 155 forks source link

Incomplete save of ckpt files #1135

Open husky23333 opened 4 months ago

husky23333 commented 4 months ago

I am using dlrover on Megatron-DeepSpeed,and my machine has 4 GPUs. The hybrid parallel settings are as follows, TP:[0,1],[2,3] DP:[0,2],[1,3] At the same time, I also configured DeepSpeed with Zero 1. The saving status of ckpt files are as follows, dlrover-deepspeed

Normally, ckpt files include these, image

layer_-model_states.pt and zero_pp_rank1optim_states.pt are missing

wwj-2017-1117 commented 4 months ago

同问,当设置 tp,pp ,deepsped+zero 等并行策略,遇到网络问题,GPU故障, 节点异常,能结合弹性伸缩恢复训练吗?

workingloong commented 4 months ago

同问,当设置 tp,pp ,deepsped+zero 等并行策略,遇到网络问题,GPU故障, 节点异常,能结合弹性伸缩恢复

DLRover 的容错基于 torchelastic 的重启子进程的方案,理论上只要有 checkpoint 就可以恢复。针对具体的并行方案,能否恢复只要区别是故障后重启子进程的数量是否和故障之前的子进程数量一致,即 global world size 是否会有变化。

wwj-2017-1117 commented 4 months ago

@workingloong 在代码中,有处理 遇到GPU掉卡或者ECC错误时,重新拉起一个pod的流程 的逻辑吗 ? 好像没有看到。manager.HandleFaultPods 这个逻辑也不像哦