intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System
Other
1.27k stars 167 forks source link

while using megatron distributed flash-checkpoint to recovery, error ocurs when load_checkpoint #1233

Open deepcoldfish opened 3 months ago

deepcoldfish commented 3 months ago

Env: 16GPUs + llama2 pretrain+ megatron-lm strategy: TP 8 + PP 1 + DP 2 case: when killing a training proceess to retrigger fault-tollerence with megatron-distributed flash-checkpoint,the dp 1 group load_checkpoint failed with the following log,

WARNING: on rank 11 found iteration 15 in the metadata while max iteration across the ranks is 4160813071, replacing it with max iteration.WARNING: on rank 10 found iteration 15 in the metadata while max iteration across the ranks is 4160813071, replacing it with max iteration.WARNING: on rank 14 found iteration 15 in the metadata while max iteration across the ranks is 4160813071, replacing it with max iteration.

The reason is that dp 1 group load checkpoint from storage for no model in memory and uses allreduce when read_metadata, meanwhile dp 0 group only load from memory.

BalaBalaYi commented 1 month ago

Can u provide more information? The more detailed, the better. e.g. Detail of killing. (failed cp step?, load cp step after failover?)

deepcoldfish commented 5 days ago

Can u provide more information? The more detailed, the better. e.g. Detail of killing. (failed cp step?, load cp step after failover?)

When training after save checkpoint to memory or storage, kill a training process (in node 1) to retrigger the restart of training cluster.

After restart, all node will recovery from memory.

When dp rank !=0, model_state_dict is empty , and will go to here and read_metadata here. Nodes with dp_rank = 0, have model_state_dict in memory , and will not go to this branch.

read_metadata will trigger global sync among all node group, and will cause step failing out.