huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.31k stars 26.35k forks source link

Add an option to decide whether to store the checkpoint and rng_state. #26706

Open timturing opened 11 months ago

timturing commented 11 months ago

Motivation: Currently, when using the Transformers library in combination with DeepSpeed for training large language models like LLMs, checkpoints (e.g. bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt) are automatically saved along with the rng_state, which can lead to significant disk space usage. In scenarios where multiple GPUs are employed for training, this can quickly become a storage bottleneck, especially when shared by a team. Sometimes we just want to keep the bin file (e.g. pytorch_model-00001-of-00002.bin) as it's enough for load again.

Feature Request: I propose adding a configurable option to decide whether to store the checkpoint and rng_state during training. This will give users the flexibility to choose when to save checkpoints and reduce the disk space required.

Proposed Solution:

  1. Add a new parameter, such as save_checkpoint_enabled, to the DeepSpeed configuration file. Users can set this parameter to True or False to control whether checkpoints and rng_state should be saved during training.

  2. Modify the trainer.py script in the Transformers library to include a condition for self.save_checkpoint_enabled in the _save_checkpoint function. Here's a code snippet illustrating the change:

    if self.is_deepspeed_enabled and self.save_checkpoint_enabled:
       # Save the checkpoint

This change will allow users to save disk space by not storing checkpoints when not needed, and it can help alleviate the storage challenges associated with large-scale language model training.

I have already submitted this issue to the DeepSpeed library #https://github.com/microsoft/DeepSpeed/issues/4403#issue-1913025248 , as this feature may require collaboration between both libraries.

timturing commented 11 months ago

The issue I reported is still impacting my work as our group is building a pretty big project based on this, and I believe it's an important one to address. I would be grateful if you could help me.

ArthurZucker commented 11 months ago

cc @pacman100 and @muellerzr if you think this is something we ought to have!

StevenTang1998 commented 10 months ago

@pacman100 @muellerzr I also met the save situtation. Could you provide an option to save disk memory?

pacman100 commented 10 months ago

Hello @timturing, checkpoints during training are meant for resuming it and therefore save the model, optimizer and scheduler and rng states. What you want is to just save the model without considering the ability to resume training. Is that understanding correct?

timturing commented 10 months ago

@pacman100 Yes, exactly.

StevenTang1998 commented 10 months ago

Just like the save_strategy in the Trainer (https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.save_strategy). Since SFT is very mature, we do not need to save the intermediate results for resuming.

StevenTang1998 commented 10 months ago

@pacman100 @muellerzr Hi, could you improve this? This is very useful for me.