Open Orion-Zheng opened 7 months ago
@Orion-Zheng, this is expected because universal checkpointing requires some metadata to be saved by the client in the checkpoint. At this time, we have only modified Megatron-DeepSpeed client to save the required metadata. Similar changes need to be applied to HF trainer checkpointing save logic. If you have bandwidth to work on this, I think it will have a great impact of enabling universal checkpointing to HF training.
Thank you. I also think this would be a very impactful work because so many people use Huggingface Trainer now.😃After this month I think I will have some bandwidth to do this. I am familiar with Trainer's save logic but currently not very familiar with Deepspeed and Megatron's. I will try to read the code by myself first and ask you if still encounter some barrier.
I've encountered the same error for checkpoint saved in pytorch lightning + deepspeed, so this ds_to_universal.py
script doesn't support pytorch-lightning, too?
Hi @tjruwase , I tried to add the UNIVERSAL_CHECKPOINT_INFO
to the client_state
, and the ds_to_universal.py
works fine.
{
'universal_checkpoint_info':
{
'universal_checkpoint_version': 0.2
}
}
Then how to load this universal_folder to the model? I find that when using Megatron-DeepSpeed, there's a flag called universal-checkpoint
, and its only usage I've found in Megatron-DeepSpeed is to set ds_config_dict["checkpoint"] = {"load_universal": True}
However, I'm still confused how to load the universal_checkpoint_folder
.
Any hint or instruction is welcomed!
Thank you for your attention! Looking forward to your reply!
@Orion-Zheng Could you provide the scripts you used for training? I would be happy to help solve the issue.
@Orion-Zheng Could you provide the scripts you used for training? I would be happy to help solve the issue.
I think the point is not the scripts for training, ds_to_universal.py
will check if there's any universal_checkpoint_info
key in the checkpoint.
Aware this, I fool the script by add the key without anything in the checkpoint, and it works, as this comment says.
Hi @tjruwase , I tried to add the
UNIVERSAL_CHECKPOINT_INFO
to theclient_state
, and theds_to_universal.py
works fine.{ 'universal_checkpoint_info': { 'universal_checkpoint_version': 0.2 } }
Then how to load this universal_folder to the model? I find that when using Megatron-DeepSpeed, there's a flag called
universal-checkpoint
, and its only usage I've found in Megatron-DeepSpeed is to setds_config_dict["checkpoint"] = {"load_universal": True}
However, I'm still confused how to load the
universal_checkpoint_folder
.Any hint or instruction is welcomed!
Thank you for your attention! Looking forward to your reply!
However, without checking the Megatron-DeepSpeed repo due to environment installation faliure, I don't know the exact value of the key universal_checkpoint_info
should be and whether the "foolish" action could affect the performance.
And more importantly, I've got a directory that I don't know how to load :(
@Orion-Zheng This PR should fix the issue you mentioned (universal checkpoint does not support HF trainer). Feel free to ping me if you have any questions or suggestions on this PR.
Wow great! I will try it later and get back to you😃Many thanks for your work!
Hello @xylian86 , I was previously using the HF trainer. Why doesn't the universal checkpoint support the HF trainer? Is there any way to load the universal checkpoint? Do I have to switch training frameworks to deepspeed?
Edit: I am using the HF lr scheduler + DS optimizer for training. I've managed to load the universal checkpoint by forcing load_universal_checkpoint
to return True, but the training loop exits silently after first iteration.
I trained model using Accelerate+Deepspeed ZeRO-2 and got a ZeRO-2 checkpoint. The checkpoint structure is listed below. And this is the Google Drive link to my checkpoint.
I tried to convert this ZeRO-2 checkpoint to the universal format using
ds_to_universal.py
but encountered errors:It seems the checkpoint structure is a bit different from Universal Checkpoint examples in Megatron-Deepspeed.
May I ask how can i find the
universal_checkpoint_info
in my checkpoint?