intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System
Other
1.1k stars 140 forks source link

Error encountered when using falsh checkpoint #1144

Open chencjcj opened 1 month ago

chencjcj commented 1 month ago

I used flash checkpoint to run training in magatron-lm and encountered an error when saving the checkpoint,The training has been stopped here. [2024-05-29 06:41:59,152] [INFO] [engine.py:130:start_saver_process] Start a process to asynchronously save checkpoint. [2024-05-29 06:41:59,152] [INFO] [engine.py:130:start_saver_process] Start a process to asynchronously save checkpoint. [2024-05-29 06:41:59,153] [INFO] [engine.py:44:_local_rank0_log] Use the default process group to sync when saving checkpoint. [2024-05-29 06:41:59,153] [INFO] [engine.py:44:_local_rank0_log] Use the default process group to sync when saving checkpoint. [2024-05-29 06:41:59,158] [INFO] [ckpt_saver.py:434:_factory] Start the checkpoint saver factory. [2024-05-29 06:41:59,159] [INFO] [ckpt_saver.py:434:_factory] Start the checkpoint saver factory. [2024-05-29 06:42:00,163] [INFO] [ckpt_saver.py:399:init] Initialize the AsyncSaver with arguments: checkpoint_dir=./ckpt, local_shard_num=1, global_shard_num=1, [2024-05-29 06:42:00,163] [INFO] [ckpt_saver.py:522:_sync_shm_to_storage] Async flash checkpoint saver starts! [2024-05-29 06:42:00,163] [INFO] [ckpt_saver.py:399:init] Initialize the AsyncSaver with arguments: checkpoint_dir=./ckpt, local_shard_num=1, global_shard_num=1, [2024-05-29 06:42:00,163] [INFO] [ckpt_saver.py:522:_sync_shm_to_storage] Async flash checkpoint saver starts! [2024-05-29 06:42:01,159] [INFO] [ckpt_saver.py:526:_sync_shm_to_storage] Reset the shared memory after the training starts. The number of global shards is 1. [2024-05-29 06:42:01,160] [INFO] [ckpt_saver.py:526:_sync_shm_to_storage] Reset the shared memory after the training starts. The number of global shards is 1. saving checkpoint at iteration 20 to ./ckpt [2024-05-29 06:42:01,161] [INFO] [engine.py:303:save_state_dict_to_memory] 0 acquired the lock of shared memory: True. [2024-05-29 06:42:01,171] [INFO] [engine.py:303:save_state_dict_to_memory] 0 acquired the lock of shared memory: False.

chencjcj commented 1 month ago

Megatron-lm:main dlover:v0.3.7