When performing multi-node, multi-GPU training with Megatron-LM, if the 'rank' is only input in the startup script and not set in the environment variables, an exception may occur (stroagetype is disk)

lkq51 commented 3 months ago

When using dlrover version 0.3.7 in an environment set to nvcr.io/nvidia/pytorch:24.01-py3, and when used in conjunction with Megatron-LM, if the rank value is set only in the script and not in the environment variables , a multi-threading issue can occur.

Specifically, multiple nodes will execute commit_checkpoint, waiting for the number of done files to equal global_shard_num, and then call self.storage.safe_rmtree(step_done_dir). Other nodes will report an error stating "The number of ready shards is 1 != 2", indicating that the numbers never match, ultimately preventing the checkpoint from being saved properly. The program will time out and exit abnormally.

workingloong commented 3 months ago

Do you use the shared storage by nodes to save the checkpoint?

lkq51 commented 3 months ago

Do you use the shared storage by nodes to save the checkpoint?

Yes, I am using a dual-machine environment with shared storage. Essentially, the problem lies in the ckpt_saver.py at line 852, where the judgmentif self._is_agent_rank_0is made to rigidly obtain the rank value from the environment variables. If it does not exist, it will be assigned 0.

If it's non-elastic scheduling, I think it might be better to obtain the rank value from the launch script instead.

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity.

github-actions[bot] commented 1 week ago

This issue is being automatically closed due to inactivity.

TomSuen commented 6 days ago

When using dlrover version 0.3.7 in an environment set to nvcr.io/nvidia/pytorch:24.01-py3, and when used in conjunction with Megatron-LM, if the rank value is set only in the script and not in the environment variables , a multi-threading issue can occur.

Specifically, multiple nodes will execute commit_checkpoint, waiting for the number of done files to equal global_shard_num, and then call self.storage.safe_rmtree(step_done_dir). Other nodes will report an error stating "The number of ready shards is 1 != 2", indicating that the numbers never match, ultimately preventing the checkpoint from being saved properly. The program will time out and exit abnormally.

您好，请问您是如何将dlrover跟Megatron-LM一起使用的？仅仅是在保存ckpt的时候用了flash checkpoint吗？还是启动脚本也可以跟Megatron-LM一起使用呢？如果可以，请问指令是什么呀？我现在想用dlrover-run启动含有deepspeed配置的脚本，但是主页说dlrover-run只能替换用torchrun运行的脚本

intelligent-machine-learning / dlrover

When performing multi-node, multi-GPU training with Megatron-LM, if the 'rank' is only input in the startup script and not set in the environment variables, an exception may occur (stroagetype is disk) #1208