Thanks for amazing work to accelerate distributed training. When I use 'deepspeed train.py' to start megatron-lm train task, I get this log
It seems only rank 0 saved the model weight to shared memory, so when save to disk it's blocked. But when I use dlrover-run to start the training task, the flash checkpoint saves the model weights normally.
Thanks for amazing work to accelerate distributed training. When I use 'deepspeed train.py' to start megatron-lm train task, I get this log It seems only rank 0 saved the model weight to shared memory, so when save to disk it's blocked. But when I use dlrover-run to start the training task, the flash checkpoint saves the model weights normally.
Using Megatron-LM 0.6.0 and dlrover 0.3.6rc0