intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System
Other
1.26k stars 163 forks source link

deepspeed zero3 also save ckpt only in rank 0? #1256

Closed Alex-Ruan closed 1 month ago

Alex-Ruan commented 2 months ago

According to /opt/conda/envs/dlrover/lib/python3.10/site-packages/dlrover/trainer/torch/flash_checkpoint/deepspeed_engine.py

def save_to_storage(self, step, state_dict, paths):
        """
        Asynchonously saves the state dict into the storage. It synchonously
        saves the state dict into the shared memory and put the path
        into a shared queue. The agent in the main process waits for the queue
        for save the state dict in the shared memory into the storage.
        Only rank 0 saves the state dict into the storage.
        Args:
            step (int): the global iteration step.
            state_dict (dict): the state dict of model and optimizer to save.
            paths (dict): the key is a category in
                ["model_states", "optim_states"] of the state dict and
                the value is the path of storage to save.
        """
        success = True
        if step > self._cached_step:
            success = self.save_to_memory(step, state_dict, paths)
        if dist.is_initialized():
            dist.barrier()
        # Only local rank 0 to notify the saving event to the agent.
        if self._local_rank == 0 and success:
            event = CheckpointEvent(type=CheckpointEventType.SAVE, step=step)
            self._event_queue.put(event)
        return success

self._local_rank == 0 means it saves ckpt only in rank0, but I use deepspeed zero3, the model state dict is divided into mutiply gpus. 请问deepspeed zero3的时候模型被shard到多卡了,保存ckpt的时候还是只异步保存rank 0那张卡的ckpt吗,感谢帮忙解答一下!

workingloong commented 2 months ago

不是的,这个是只需要 local rank 0 去通知异步进程执行将共享内存的ckpt 导出到磁盘。