Open deepcoldfish opened 3 months ago
Can u provide more information? The more detailed, the better. e.g. Detail of killing. (failed cp step?, load cp step after failover?)
Can u provide more information? The more detailed, the better. e.g. Detail of killing. (failed cp step?, load cp step after failover?)
When training after save checkpoint to memory or storage, kill a training process (in node 1) to retrigger the restart of training cluster.
After restart, all node will recovery from memory.
When dp rank !=0, model_state_dict is empty , and will go to here and read_metadata here. Nodes with dp_rank = 0, have model_state_dict in memory , and will not go to this branch.
read_metadata will trigger global sync among all node group, and will cause step failing out.
Env: 16GPUs + llama2 pretrain+ megatron-lm strategy: TP 8 + PP 1 + DP 2 case: when killing a training proceess to retrigger fault-tollerence with megatron-distributed flash-checkpoint,the dp 1 group load_checkpoint failed with the following log,
The reason is that dp 1 group load checkpoint from storage for no model in memory and uses allreduce when read_metadata, meanwhile dp 0 group only load from memory.