Closed lkq51 closed 1 week ago
Do you use the shared storage by nodes to save the checkpoint?
Do you use the shared storage by nodes to save the checkpoint?
Yes, I am using a dual-machine environment with shared storage. Essentially, the problem lies in the ckpt_saver.py at line 852, where the judgmentif self._is_agent_rank_0
is made to rigidly obtain the rank value from the environment variables. If it does not exist, it will be assigned 0.
If it's non-elastic scheduling, I think it might be better to obtain the rank value from the launch script instead.
This issue has been automatically marked as stale because it has not had recent activity.
This issue is being automatically closed due to inactivity.
When using dlrover version 0.3.7 in an environment set to nvcr.io/nvidia/pytorch:24.01-py3, and when used in conjunction with Megatron-LM, if the rank value is set only in the script and not in the environment variables , a multi-threading issue can occur.
Specifically, multiple nodes will execute
commit_checkpoint
, waiting for the number of done files to equal global_shard_num, and then callself.storage.safe_rmtree(step_done_dir)
. Other nodes will report an error stating "The number of ready shards is 1 != 2", indicating that the numbers never match, ultimately preventing the checkpoint from being saved properly. The program will time out and exit abnormally.
您好,请问您是如何将dlrover跟Megatron-LM一起使用的?仅仅是在保存ckpt的时候用了flash checkpoint吗?还是启动脚本也可以跟Megatron-LM一起使用呢?如果可以,请问指令是什么呀?我现在想用dlrover-run启动含有deepspeed配置的脚本,但是主页说dlrover-run只能替换用torchrun运行的脚本
When using dlrover version 0.3.7 in an environment set to nvcr.io/nvidia/pytorch:24.01-py3, and when used in conjunction with Megatron-LM, if the rank value is set only in the script and not in the environment variables , a multi-threading issue can occur.
Specifically, multiple nodes will execute
commit_checkpoint
, waiting for the number of done files to equal global_shard_num, and then callself.storage.safe_rmtree(step_done_dir)
. Other nodes will report an error stating "The number of ready shards is 1 != 2", indicating that the numbers never match, ultimately preventing the checkpoint from being saved properly. The program will time out and exit abnormally.