microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.39k stars 4.11k forks source link

[BUG]Zero stage3 can not save model weights correctly! #3841

Open tarrett opened 1 year ago

tarrett commented 1 year ago

Describe the bug Traing the llama-7b model with zero stage3 and set stage3_gather_16bit_weights_on_model_save to true in ds_config.json, but the size of saved pytorch-model.bin is only 610K. It is strange that the saved model in checkpoint is normal.

The deepspeed version is 0.9.6

ajinkya123-robo commented 1 year ago

Hi, Have you managed to solve it? I am also facing same issue. Thanks

tjruwase commented 1 year ago

@tarrett, thanks for reporting this. Can you please share repro steps?

ZubinGou commented 1 year ago

Same issue: with Deepspeed Zero Stage3 + Transformers Trainer, we can't correctly save the final model weight after training with trainer.save_model(). However, we can save the checkpoints during training instead.

tjruwase commented 1 year ago

@ZubinGou, can you please share details to help us repro? Thanks!

ZubinGou commented 1 year ago

Sure. Simply use the official Zero Stage3 config by setting stage3_gather_16bit_weights_on_model_save as true following this. Then, use the Huggingface Trainer to train a GPT-2 or LLaMA (or any models) with trainer.train(), and save the model with trainer.save_model(), you will find the saved weight is still not complete.

You can use any of the following repositories to reproduce this issue:

Mr-lonely0 commented 4 months ago

Hello, have you solved this?