Closed MnLgt closed 2 months ago
Yes, that's normal behavior when using DeepSpeed. The mp_rank_00_model_states.pt
file contains the model states for the first process (rank 0) in a multi-process setting. You can convert this file to a .bin
format using PyTorch's torch.save()
function with the appropriate parameters to achieve a similar checkpoint structure. You could use the following code to do this
import torch
# Load the model states from the mp_rank_00_model_states.pt file
model_states = torch.load("mp_rank_00_model_states.pt", map_location=torch.device('cpu'))
# Save the model states in .bin format
torch.save(model_states, "model_states.bin")
This code will load the model states from mp_rank_00_model_states.pt
and save them in a .bin
file named model_states.bin
. You can then use this .bin
file for logging validation during training.
Ah, thanks for the explanation!
You're welcome
Be sure to close the issue in other to clear traffic
Hi,
When i save a checkpoint during training i've been using accelerator.save_state which saves a model.safetensors or a pytorch_model.bin, depending on the safe_serialization settings.
However, I'm trying to implement deepspeed but the checkpoint saving behavior is different. Instead of a model.safetensors or pytorch_model.bin, it's saving mp_rank_00_model_states.pt. Is this normal and, if so, is there a way to convert this to a .bin file to use for logging validation during training?