Why model_optim_rng.pt is saved in a seperate directory?

zhaoyang-star commented 1 month ago

Megatron-LM saves model_optim_rng.pt and distrib_optim.pt in directory named as mp_rank_xx_xxx. But In dlrover, distrib_optim.pt is been seperated and saved in a directory named as rank_xxxx.

It is ok if ckpt are been saved and loaded by using dlrover. But it will fail if saved by using Megatron-LM and then loaded by dlrover. So I am curious why it is been designed as this way? Thanks @workingloong

workingloong commented 1 month ago

The flash checkpoint in DLRover saves and loads the distributed optimizer checkpoint of Megatron-LM in parallel. This is, each rank saves and loads its owner shard of optimizer states into the rank_xxxx file. You can see the detail https://github.com/intelligent-machine-learning/dlrover/blob/master/docs/blogs/megatron_flash_checkpoint.md#save-and-load-distributed-optimizer-in-parallel

zhaoyang-star commented 1 month ago

@workingloong Thanks for your quick reply. I got it.

I tried benchmarking dlrover and found save_to_memory costs ~55sec. Is it normal? From the blogs the cost of save_to_memory is below 1sec. Please correct if I misunderstand anything. Parts of logs as following:

192.169.125.62: saving checkpoint at iteration     800 to /mnt/home/flash_checkpoint_output_0802/outputs/checkpoint/16b-lr1e-4-tp1-pp4
192.169.125.62: [2024-08-02 13:35:46,237] [INFO] [engine.py:303:save_state_dict_to_memory] 7 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,237] [INFO] [engine.py:303:save_state_dict_to_memory] 1 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,237] [INFO] [engine.py:303:save_state_dict_to_memory] 5 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,238] [INFO] [engine.py:303:save_state_dict_to_memory] 3 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,249] [INFO] [engine.py:303:save_state_dict_to_memory] 2 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,250] [INFO] [engine.py:303:save_state_dict_to_memory] 0 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,250] [INFO] [engine.py:303:save_state_dict_to_memory] 6 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,251] [INFO] [engine.py:303:save_state_dict_to_memory] 4 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:36:35,564] [INFO] [engine.py:99:wrapper] Local rank 2 execute save_to_memory in 49.314s.
192.169.125.62: [2024-08-02 13:36:36,881] [INFO] [engine.py:99:wrapper] Local rank 7 execute save_to_memory in 50.645s.
192.169.125.62: [2024-08-02 13:36:37,891] [INFO] [engine.py:99:wrapper] Local rank 1 execute save_to_memory in 51.654s.
192.169.125.62: [2024-08-02 13:36:38,761] [INFO] [engine.py:99:wrapper] Local rank 3 execute save_to_memory in 52.525s.
192.169.125.62: [2024-08-02 13:36:40,280] [INFO] [engine.py:99:wrapper] Local rank 0 execute save_to_memory in 54.031s.
192.169.125.62: [2024-08-02 13:36:42,972] [INFO] [engine.py:99:wrapper] Local rank 4 execute save_to_memory in 56.722s.
192.169.125.62: [2024-08-02 13:36:55,181] [INFO] [engine.py:99:wrapper] Local rank 6 execute save_to_memory in 68.931s.
192.169.125.62: [2024-08-02 13:37:33,870] [INFO] [engine.py:99:wrapper] Local rank 5 execute save_to_memory in 107.633s.
192.169.125.62: [2024-08-02 13:37:33,870] [INFO] [engine.py:99:wrapper] Local rank 1 execute save_to_storage in 107.634s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 2 execute save_to_storage in 107.621s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 5 execute save_to_storage in 107.634s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 3 execute save_to_storage in 107.634s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [ckpt_saver.py:532:_sync_shm_to_storage] ShardingSaver save checkpoint to storage, event CheckpointEvent(type=<CheckpointEventType.SAVE: 1>, step=800, global_shard_num=0)
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 4 execute save_to_storage in 107.62s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 0 execute save_to_storage in 107.621s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 6 execute save_to_storage in 107.621s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 7 execute save_to_storage in 107.634s.
192.169.125.62: (min, max) time across ranks (ms):
192.169.125.62:     save-checkpoint ................................: (107635.64, 107635.84)

zhaoyang-star commented 1 month ago

Just another question: Megatron-LM has supported asynchronous checkpoint saving since v0.7.0. Have you compared between dlrover and v0.7.0?

workingloong commented 1 month ago

@workingloong Thanks for your quick reply. I got it.

I tried benchmarking dlrover and found save_to_memory costs ~55sec. Is it normal? From the blogs the cost of save_to_memory is below 1sec. Please correct if I misunderstand anything. Parts of logs as following:

192.169.125.62: saving checkpoint at iteration     800 to /mnt/home/flash_checkpoint_output_0802/outputs/checkpoint/16b-lr1e-4-tp1-pp4
192.169.125.62: [2024-08-02 13:35:46,237] [INFO] [engine.py:303:save_state_dict_to_memory] 7 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,237] [INFO] [engine.py:303:save_state_dict_to_memory] 1 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,237] [INFO] [engine.py:303:save_state_dict_to_memory] 5 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,238] [INFO] [engine.py:303:save_state_dict_to_memory] 3 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,249] [INFO] [engine.py:303:save_state_dict_to_memory] 2 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,250] [INFO] [engine.py:303:save_state_dict_to_memory] 0 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,250] [INFO] [engine.py:303:save_state_dict_to_memory] 6 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,251] [INFO] [engine.py:303:save_state_dict_to_memory] 4 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:36:35,564] [INFO] [engine.py:99:wrapper] Local rank 2 execute save_to_memory in 49.314s.
192.169.125.62: [2024-08-02 13:36:36,881] [INFO] [engine.py:99:wrapper] Local rank 7 execute save_to_memory in 50.645s.
192.169.125.62: [2024-08-02 13:36:37,891] [INFO] [engine.py:99:wrapper] Local rank 1 execute save_to_memory in 51.654s.
192.169.125.62: [2024-08-02 13:36:38,761] [INFO] [engine.py:99:wrapper] Local rank 3 execute save_to_memory in 52.525s.
192.169.125.62: [2024-08-02 13:36:40,280] [INFO] [engine.py:99:wrapper] Local rank 0 execute save_to_memory in 54.031s.
192.169.125.62: [2024-08-02 13:36:42,972] [INFO] [engine.py:99:wrapper] Local rank 4 execute save_to_memory in 56.722s.
192.169.125.62: [2024-08-02 13:36:55,181] [INFO] [engine.py:99:wrapper] Local rank 6 execute save_to_memory in 68.931s.
192.169.125.62: [2024-08-02 13:37:33,870] [INFO] [engine.py:99:wrapper] Local rank 5 execute save_to_memory in 107.633s.
192.169.125.62: [2024-08-02 13:37:33,870] [INFO] [engine.py:99:wrapper] Local rank 1 execute save_to_storage in 107.634s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 2 execute save_to_storage in 107.621s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 5 execute save_to_storage in 107.634s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 3 execute save_to_storage in 107.634s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [ckpt_saver.py:532:_sync_shm_to_storage] ShardingSaver save checkpoint to storage, event CheckpointEvent(type=<CheckpointEventType.SAVE: 1>, step=800, global_shard_num=0)
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 4 execute save_to_storage in 107.62s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 0 execute save_to_storage in 107.621s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 6 execute save_to_storage in 107.621s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 7 execute save_to_storage in 107.634s.
192.169.125.62: (min, max) time across ranks (ms):
192.169.125.62:     save-checkpoint ................................: (107635.64, 107635.84)

Did you use distributed_optimizer and the following APIs?

from dlrover.trainer.torch.flash_checkpoint.megatron_dist_ckpt import save_checkpoint
from dlrover.trainer.torch.flash_checkpoint.megatron_dist_ckpt import load_checkpoint

workingloong commented 1 month ago

Just another question: Megatron-LM has supported asynchronous checkpoint saving since v0.7.0. Have you compared between dlrover and v0.7.0?

Not yet.

zhaoyang-star commented 1 month ago

Did you use distributed_optimizer and the following APIs?

Yes, both are used. It is weird when training a 16B model, the saving to memory costs about 50sec. BTW, the memory saving time is also about 50sec when using Megatron-LM's async save. Maybe the bandwidth of my env's disk is low.

workingloong commented 1 month ago

ut 50sec. BTW, the memory saving time is also about 50sec when using Megatron-LM's async save. Maybe the bandwidth of my env's disk is

Yeah, the performance disk may affect the performance to save the checkpoint into the memory. Because, the async checkpoint use the shared memory which need to create a file on the disk. I conducted some experiments and found that the performance to save the checkpoint into the memory with SSD is much better than NAS.

intelligent-machine-learning / dlrover

Why model_optim_rng.pt is saved in a seperate directory? #1225