Open woshiyyya opened 1 year ago
To train a 70b model, each shard generate a very large binary file in disk, when sync the binary file from the local to s3, it always gets timeout. It is a great benefit if there a parameter which can passed in to save model with weights only
Currently, in ZeRO-3, we are saving model params and optimizer states together in
.*_optim_state.pt
files. Saving optimzier states may greatly increase the checkpoint size, while we don't actually needs them for inference.In order to extract the model weights, we need to load all checkpoint shards into memory, then extract the weights out, which requires a huge amount of RAM.
Therefore, it would be great to enable only saving model weights in checkpoints.