Open tarrett opened 1 year ago
Hi, Have you managed to solve it? I am also facing same issue. Thanks
@tarrett, thanks for reporting this. Can you please share repro steps?
Same issue: with Deepspeed Zero Stage3 + Transformers Trainer, we can't correctly save the final model weight after training with trainer.save_model()
. However, we can save the checkpoints during training instead.
@ZubinGou, can you please share details to help us repro? Thanks!
Sure. Simply use the official Zero Stage3 config by setting stage3_gather_16bit_weights_on_model_save
as true
following this. Then, use the Huggingface Trainer to train a GPT-2 or LLaMA (or any models) with trainer.train()
, and save the model with trainer.save_model()
, you will find the saved weight is still not complete.
You can use any of the following repositories to reproduce this issue:
Hello, have you solved this?
Describe the bug Traing the llama-7b model with zero stage3 and set stage3_gather_16bit_weights_on_model_save to true in ds_config.json, but the size of saved pytorch-model.bin is only 610K. It is strange that the saved model in checkpoint is normal.
The deepspeed version is 0.9.6