Open husky23333 opened 4 months ago
同问,当设置 tp,pp ,deepsped+zero 等并行策略,遇到网络问题,GPU故障, 节点异常,能结合弹性伸缩恢复训练吗?
同问,当设置 tp,pp ,deepsped+zero 等并行策略,遇到网络问题,GPU故障, 节点异常,能结合弹性伸缩恢复
DLRover 的容错基于 torchelastic 的重启子进程的方案,理论上只要有 checkpoint 就可以恢复。针对具体的并行方案,能否恢复只要区别是故障后重启子进程的数量是否和故障之前的子进程数量一致,即 global world size 是否会有变化。
@workingloong 在代码中,有处理 遇到GPU掉卡或者ECC错误时,重新拉起一个pod的流程 的逻辑吗 ? 好像没有看到。manager.HandleFaultPods 这个逻辑也不像哦
I am using dlrover on Megatron-DeepSpeed,and my machine has 4 GPUs. The hybrid parallel settings are as follows, TP:[0,1],[2,3] DP:[0,2],[1,3] At the same time, I also configured DeepSpeed with Zero 1. The saving status of ckpt files are as follows,
Normally, ckpt files include these,
layer_-model_states.pt and zero_pp_rank1optim_states.pt are missing