Open zhaoyang-star opened 1 month ago
The flash checkpoint in DLRover saves and loads the distributed optimizer checkpoint of Megatron-LM in parallel. This is, each rank saves and loads its owner shard of optimizer states into the rank_xxxx
file. You can see the detail https://github.com/intelligent-machine-learning/dlrover/blob/master/docs/blogs/megatron_flash_checkpoint.md#save-and-load-distributed-optimizer-in-parallel
@workingloong Thanks for your quick reply. I got it.
I tried benchmarking dlrover and found save_to_memory
costs ~55sec. Is it normal? From the blogs the cost of save_to_memory
is below 1sec. Please correct if I misunderstand anything. Parts of logs as following:
192.169.125.62: saving checkpoint at iteration 800 to /mnt/home/flash_checkpoint_output_0802/outputs/checkpoint/16b-lr1e-4-tp1-pp4
192.169.125.62: [2024-08-02 13:35:46,237] [INFO] [engine.py:303:save_state_dict_to_memory] 7 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,237] [INFO] [engine.py:303:save_state_dict_to_memory] 1 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,237] [INFO] [engine.py:303:save_state_dict_to_memory] 5 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,238] [INFO] [engine.py:303:save_state_dict_to_memory] 3 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,249] [INFO] [engine.py:303:save_state_dict_to_memory] 2 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,250] [INFO] [engine.py:303:save_state_dict_to_memory] 0 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,250] [INFO] [engine.py:303:save_state_dict_to_memory] 6 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,251] [INFO] [engine.py:303:save_state_dict_to_memory] 4 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:36:35,564] [INFO] [engine.py:99:wrapper] Local rank 2 execute save_to_memory in 49.314s.
192.169.125.62: [2024-08-02 13:36:36,881] [INFO] [engine.py:99:wrapper] Local rank 7 execute save_to_memory in 50.645s.
192.169.125.62: [2024-08-02 13:36:37,891] [INFO] [engine.py:99:wrapper] Local rank 1 execute save_to_memory in 51.654s.
192.169.125.62: [2024-08-02 13:36:38,761] [INFO] [engine.py:99:wrapper] Local rank 3 execute save_to_memory in 52.525s.
192.169.125.62: [2024-08-02 13:36:40,280] [INFO] [engine.py:99:wrapper] Local rank 0 execute save_to_memory in 54.031s.
192.169.125.62: [2024-08-02 13:36:42,972] [INFO] [engine.py:99:wrapper] Local rank 4 execute save_to_memory in 56.722s.
192.169.125.62: [2024-08-02 13:36:55,181] [INFO] [engine.py:99:wrapper] Local rank 6 execute save_to_memory in 68.931s.
192.169.125.62: [2024-08-02 13:37:33,870] [INFO] [engine.py:99:wrapper] Local rank 5 execute save_to_memory in 107.633s.
192.169.125.62: [2024-08-02 13:37:33,870] [INFO] [engine.py:99:wrapper] Local rank 1 execute save_to_storage in 107.634s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 2 execute save_to_storage in 107.621s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 5 execute save_to_storage in 107.634s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 3 execute save_to_storage in 107.634s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [ckpt_saver.py:532:_sync_shm_to_storage] ShardingSaver save checkpoint to storage, event CheckpointEvent(type=<CheckpointEventType.SAVE: 1>, step=800, global_shard_num=0)
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 4 execute save_to_storage in 107.62s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 0 execute save_to_storage in 107.621s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 6 execute save_to_storage in 107.621s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 7 execute save_to_storage in 107.634s.
192.169.125.62: (min, max) time across ranks (ms):
192.169.125.62: save-checkpoint ................................: (107635.64, 107635.84)
Just another question: Megatron-LM has supported asynchronous checkpoint saving since v0.7.0. Have you compared between dlrover and v0.7.0?
@workingloong Thanks for your quick reply. I got it.
I tried benchmarking dlrover and found
save_to_memory
costs ~55sec. Is it normal? From the blogs the cost ofsave_to_memory
is below 1sec. Please correct if I misunderstand anything. Parts of logs as following:192.169.125.62: saving checkpoint at iteration 800 to /mnt/home/flash_checkpoint_output_0802/outputs/checkpoint/16b-lr1e-4-tp1-pp4 192.169.125.62: [2024-08-02 13:35:46,237] [INFO] [engine.py:303:save_state_dict_to_memory] 7 acquired the lock of shared memory: True. 192.169.125.62: [2024-08-02 13:35:46,237] [INFO] [engine.py:303:save_state_dict_to_memory] 1 acquired the lock of shared memory: True. 192.169.125.62: [2024-08-02 13:35:46,237] [INFO] [engine.py:303:save_state_dict_to_memory] 5 acquired the lock of shared memory: True. 192.169.125.62: [2024-08-02 13:35:46,238] [INFO] [engine.py:303:save_state_dict_to_memory] 3 acquired the lock of shared memory: True. 192.169.125.62: [2024-08-02 13:35:46,249] [INFO] [engine.py:303:save_state_dict_to_memory] 2 acquired the lock of shared memory: True. 192.169.125.62: [2024-08-02 13:35:46,250] [INFO] [engine.py:303:save_state_dict_to_memory] 0 acquired the lock of shared memory: True. 192.169.125.62: [2024-08-02 13:35:46,250] [INFO] [engine.py:303:save_state_dict_to_memory] 6 acquired the lock of shared memory: True. 192.169.125.62: [2024-08-02 13:35:46,251] [INFO] [engine.py:303:save_state_dict_to_memory] 4 acquired the lock of shared memory: True. 192.169.125.62: [2024-08-02 13:36:35,564] [INFO] [engine.py:99:wrapper] Local rank 2 execute save_to_memory in 49.314s. 192.169.125.62: [2024-08-02 13:36:36,881] [INFO] [engine.py:99:wrapper] Local rank 7 execute save_to_memory in 50.645s. 192.169.125.62: [2024-08-02 13:36:37,891] [INFO] [engine.py:99:wrapper] Local rank 1 execute save_to_memory in 51.654s. 192.169.125.62: [2024-08-02 13:36:38,761] [INFO] [engine.py:99:wrapper] Local rank 3 execute save_to_memory in 52.525s. 192.169.125.62: [2024-08-02 13:36:40,280] [INFO] [engine.py:99:wrapper] Local rank 0 execute save_to_memory in 54.031s. 192.169.125.62: [2024-08-02 13:36:42,972] [INFO] [engine.py:99:wrapper] Local rank 4 execute save_to_memory in 56.722s. 192.169.125.62: [2024-08-02 13:36:55,181] [INFO] [engine.py:99:wrapper] Local rank 6 execute save_to_memory in 68.931s. 192.169.125.62: [2024-08-02 13:37:33,870] [INFO] [engine.py:99:wrapper] Local rank 5 execute save_to_memory in 107.633s. 192.169.125.62: [2024-08-02 13:37:33,870] [INFO] [engine.py:99:wrapper] Local rank 1 execute save_to_storage in 107.634s. 192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 2 execute save_to_storage in 107.621s. 192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 5 execute save_to_storage in 107.634s. 192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 3 execute save_to_storage in 107.634s. 192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [ckpt_saver.py:532:_sync_shm_to_storage] ShardingSaver save checkpoint to storage, event CheckpointEvent(type=<CheckpointEventType.SAVE: 1>, step=800, global_shard_num=0) 192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 4 execute save_to_storage in 107.62s. 192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 0 execute save_to_storage in 107.621s. 192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 6 execute save_to_storage in 107.621s. 192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 7 execute save_to_storage in 107.634s. 192.169.125.62: (min, max) time across ranks (ms): 192.169.125.62: save-checkpoint ................................: (107635.64, 107635.84)
Did you use distributed_optimizer
and the following APIs?
from dlrover.trainer.torch.flash_checkpoint.megatron_dist_ckpt import save_checkpoint
from dlrover.trainer.torch.flash_checkpoint.megatron_dist_ckpt import load_checkpoint
Just another question: Megatron-LM has supported asynchronous checkpoint saving since v0.7.0. Have you compared between dlrover and v0.7.0?
Not yet.
Did you use distributed_optimizer and the following APIs?
Yes, both are used. It is weird when training a 16B model, the saving to memory costs about 50sec. BTW, the memory saving time is also about 50sec when using Megatron-LM's async save. Maybe the bandwidth of my env's disk is low.
ut 50sec. BTW, the memory saving time is also about 50sec when using Megatron-LM's async save. Maybe the bandwidth of my env's disk is
Yeah, the performance disk may affect the performance to save the checkpoint into the memory. Because, the async checkpoint use the shared memory which need to create a file on the disk. I conducted some experiments and found that the performance to save the checkpoint into the memory with SSD is much better than NAS.
Megatron-LM saves
model_optim_rng.pt
anddistrib_optim.pt
in directory named asmp_rank_xx_xxx
. But In dlrover,distrib_optim.pt
is been seperated and saved in a directory named asrank_xxxx
.It is ok if ckpt are been saved and loaded by using dlrover. But it will fail if saved by using Megatron-LM and then loaded by dlrover. So I am curious why it is been designed as this way? Thanks @workingloong