Open timturing opened 11 months ago
The issue I reported is still impacting my work as our group is building a pretty big project based on this, and I believe it's an important one to address. I would be grateful if you could help me.
cc @pacman100 and @muellerzr if you think this is something we ought to have!
@pacman100 @muellerzr I also met the save situtation. Could you provide an option to save disk memory?
Hello @timturing, checkpoints during training are meant for resuming it and therefore save the model, optimizer and scheduler and rng states. What you want is to just save the model without considering the ability to resume training. Is that understanding correct?
@pacman100 Yes, exactly.
Just like the save_strategy
in the Trainer (https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.save_strategy).
Since SFT is very mature, we do not need to save the intermediate results for resuming.
@pacman100 @muellerzr Hi, could you improve this? This is very useful for me.
Motivation: Currently, when using the Transformers library in combination with DeepSpeed for training large language models like LLMs, checkpoints (e.g.
bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
) are automatically saved along with therng_state
, which can lead to significant disk space usage. In scenarios where multiple GPUs are employed for training, this can quickly become a storage bottleneck, especially when shared by a team. Sometimes we just want to keep the bin file (e.g.pytorch_model-00001-of-00002.bin
) as it's enough for load again.Feature Request: I propose adding a configurable option to decide whether to store the checkpoint and
rng_state
during training. This will give users the flexibility to choose when to save checkpoints and reduce the disk space required.Proposed Solution:
Add a new parameter, such as
save_checkpoint_enabled
, to the DeepSpeed configuration file. Users can set this parameter toTrue
orFalse
to control whether checkpoints andrng_state
should be saved during training.Modify the
trainer.py
script in the Transformers library to include a condition forself.save_checkpoint_enabled
in the_save_checkpoint
function. Here's a code snippet illustrating the change:This change will allow users to save disk space by not storing checkpoints when not needed, and it can help alleviate the storage challenges associated with large-scale language model training.
I have already submitted this issue to the DeepSpeed library #https://github.com/microsoft/DeepSpeed/issues/4403#issue-1913025248 , as this feature may require collaboration between both libraries.