microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.86k stars 4.05k forks source link

[REQUEST] Saving model weights only in checkpoints #3820

Open woshiyyya opened 1 year ago

woshiyyya commented 1 year ago

Currently, in ZeRO-3, we are saving model params and optimizer states together in .*_optim_state.pt files. Saving optimzier states may greatly increase the checkpoint size, while we don't actually needs them for inference.

In order to extract the model weights, we need to load all checkpoint shards into memory, then extract the weights out, which requires a huge amount of RAM.

Therefore, it would be great to enable only saving model weights in checkpoints.

### Tasks
robinsonmhj commented 6 months ago

To train a 70b model, each shard generate a very large binary file in disk, when sync the binary file from the local to s3, it always gets timeout. It is a great benefit if there a parameter which can passed in to save model with weights only