Saving checkpoint with deepspeed

MnLgt commented 2 months ago

Hi,

When i save a checkpoint during training i've been using accelerator.save_state which saves a model.safetensors or a pytorch_model.bin, depending on the safe_serialization settings.

However, I'm trying to implement deepspeed but the checkpoint saving behavior is different. Instead of a model.safetensors or pytorch_model.bin, it's saving mp_rank_00_model_states.pt. Is this normal and, if so, is there a way to convert this to a .bin file to use for logging validation during training?

Ekundayo39283 commented 2 months ago

Yes, that's normal behavior when using DeepSpeed. The mp_rank_00_model_states.pt file contains the model states for the first process (rank 0) in a multi-process setting. You can convert this file to a .bin format using PyTorch's torch.save() function with the appropriate parameters to achieve a similar checkpoint structure. You could use the following code to do this

import torch

# Load the model states from the mp_rank_00_model_states.pt file
model_states = torch.load("mp_rank_00_model_states.pt", map_location=torch.device('cpu'))

# Save the model states in .bin format
torch.save(model_states, "model_states.bin")

This code will load the model states from mp_rank_00_model_states.pt and save them in a .bin file named model_states.bin. You can then use this .bin file for logging validation during training.

MnLgt commented 2 months ago

Ah, thanks for the explanation!

Ekundayo39283 commented 2 months ago

You're welcome

Be sure to close the issue in other to clear traffic

tencent-ailab / IP-Adapter

Saving checkpoint with deepspeed #336