Open Lzhang-hub opened 1 month ago
I have successfully tested the flash checkpoint using the following command with 4 A100 nodes in the forked repo whose commit id is cb995d5.
dlrover-run --max-restarts=2 --nnodes=$NNODES --nproc_per_node=$GPUS_PER_NODE pretrain_gpt.py \
--tensor-model-parallel-size $TP_SIZE \
--pipeline-model-parallel-size $PP_SIZE \
--use-distributed-optimizer \
--num-layers 48 \
--hidden-size 1600 \
--num-attention-heads 16 \
--seq-length 1024 \
--max-position-embeddings 1024 \
--micro-batch-size 4 \
--global-batch-size 8 \
--train-iters 100 \
--lr-decay-iters 320000 \
--save $CHECKPOINT_PATH \
--load $CHECKPOINT_PATH \
--data-path $DATA_PATH \
--vocab-file $VOCAB_FILE \
--merge-file $MERGE_FILE \
--split 900,50,50 \
--distributed-backend nccl \
--lr 0.00015 \
--min-lr 1.0e-5 \
--lr-decay-style cosine \
--weight-decay 1e-2 \
--clip-grad 1.0 \
--lr-warmup-fraction .01 \
--log-interval 1 \
--save-interval 100 \
--eval-interval 1000 \
--eval-iters 10
which version of dlrover? Beside, have you test with torchrun
.
I test dlrover 0.3.7 with torchrun, with this repo ,still error.
dlrover[torch]==0.3.7. I have reproduced the issue if I do not use --use-distributed-optimizer
.
I add --use-distributed-optimizer
,get a new error.
Env: 4*8=32 a100 gpu, tp2 pp8
[2024-06-03 08:11:01,842] [INFO] [ckpt_saver.py:892:commit_checkpoint] The number of ready shards is 26 != 32.
Exception in thread checkpoint-saver:
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 429, in _sa
ver
cls._saver_instance._sync_shm_to_storage()
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 535, in _sy
nc_shm_to_storage
self.save_step_checkpoint(event.step)
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 854, in sav
e_step_checkpoint
self.commit_checkpoint(
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 881, in com
mit_checkpoint
done_files = self.storage.listdir(step_done_dir)
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/storage.py", line 179, in listdir
return os.listdir(path)
FileNotFoundError: [Errno 2] No such file or directory: '/home/work/wqssd-cfs/data/LLM/multi-node-train/megatron-lm
-train/checkpoint/qy/multi-node/gpt3-1.5B-flashckpt/._dlrover_ckpt_stage/20.done'
I add
--use-distributed-optimizer
,get a new error. Env: 4*8=32 a100 gpu, tp2 pp8[2024-06-03 08:11:01,842] [INFO] [ckpt_saver.py:892:commit_checkpoint] The number of ready shards is 26 != 32. Exception in thread checkpoint-saver: Traceback (most recent call last): File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 429, in _sa ver cls._saver_instance._sync_shm_to_storage() File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 535, in _sy nc_shm_to_storage self.save_step_checkpoint(event.step) File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 854, in sav e_step_checkpoint self.commit_checkpoint( File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 881, in com mit_checkpoint done_files = self.storage.listdir(step_done_dir) File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/storage.py", line 179, in listdir return os.listdir(path) FileNotFoundError: [Errno 2] No such file or directory: '/home/work/wqssd-cfs/data/LLM/multi-node-train/megatron-lm -train/checkpoint/qy/multi-node/gpt3-1.5B-flashckpt/._dlrover_ckpt_stage/20.done'
This error is related to the storage I use and can be ignored.
For megatron-lm train with flash-ckpt, when set
pipeline parallel
, can not save sucessfully. It seems to not all ckpt save to memory.Skip persisting the checkpoint of step 60 because the cached step in memory are not consistent.
save log :