According to /opt/conda/envs/dlrover/lib/python3.10/site-packages/dlrover/trainer/torch/flash_checkpoint/deepspeed_engine.py
def save_to_storage(self, step, state_dict, paths):
"""
Asynchonously saves the state dict into the storage. It synchonously
saves the state dict into the shared memory and put the path
into a shared queue. The agent in the main process waits for the queue
for save the state dict in the shared memory into the storage.
Only rank 0 saves the state dict into the storage.
Args:
step (int): the global iteration step.
state_dict (dict): the state dict of model and optimizer to save.
paths (dict): the key is a category in
["model_states", "optim_states"] of the state dict and
the value is the path of storage to save.
"""
success = True
if step > self._cached_step:
success = self.save_to_memory(step, state_dict, paths)
if dist.is_initialized():
dist.barrier()
# Only local rank 0 to notify the saving event to the agent.
if self._local_rank == 0 and success:
event = CheckpointEvent(type=CheckpointEventType.SAVE, step=step)
self._event_queue.put(event)
return success
self._local_rank == 0 means it saves ckpt only in rank0, but I use deepspeed zero3, the model state dict is divided into mutiply gpus.
请问deepspeed zero3的时候模型被shard到多卡了,保存ckpt的时候还是只异步保存rank 0那张卡的ckpt吗,感谢帮忙解答一下!
According to /opt/conda/envs/dlrover/lib/python3.10/site-packages/dlrover/trainer/torch/flash_checkpoint/deepspeed_engine.py
self._local_rank == 0
means it saves ckpt only in rank0, but I use deepspeed zero3, the model state dict is divided into mutiply gpus. 请问deepspeed zero3的时候模型被shard到多卡了,保存ckpt的时候还是只异步保存rank 0那张卡的ckpt吗,感谢帮忙解答一下!