How to save the optimizer state while enabling Deepspeed to save the model

huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support

https://huggingface.co/docs/accelerate

Apache License 2.0

7.97k stars 970 forks source link

How to save the optimizer state while enabling Deepspeed to save the model #3190

Closed ITerydh closed 3 weeks ago

ITerydh commented 1 month ago

System Info

Unrelated to configuration

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

unwrapped_model = accelerator.unwrap_model(transformer)  
unwrapped_model.save_pretrained(save_directory,  
save_function=accelerator.save,  
state_dict=accelerator.get_state_dict(transformer))

I am using Deepspeed Zero2. I want to save the model state and optimizer state, but the current save_pretrained() only supports saving the model state. How can I save the optimizer state?

Expected behavior

I would like to know if it supports saving optimizer state and how to use it.

THANKS！

jubueche commented 3 weeks ago

def _save_checkpoint(self, model, trial, metrics=None):
    # In all cases, including ddp/dp/deepspeed, self.model is always a reference to the model we
    # want to save except FullyShardedDDP.
    # assert unwrap_model(model) is self.model, "internal model should be a reference to self.model"

    # Save model checkpoint
    checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"

    if self.hp_search_backend is None and trial is None:
        self.store_flos()

    run_dir = self._get_output_dir(trial=trial)
    output_dir = os.path.join(run_dir, checkpoint_folder)
    self.save_model(output_dir, _internal_call=True)

    if not self.args.save_only_model:
        # Save optimizer and scheduler
        self._save_optimizer_and_scheduler(output_dir)
        # Save RNG state
        self._save_rng_state(output_dir)

This is from trainer.py so you could have a look at self._save_optimizer_and_scheduler(output_dir) and self._save_rng_state(output_dir)

ITerydh commented 3 weeks ago

This is from trainer.py so you could have a look at self._save_optimizer_and_scheduler(output_dir) and self._save_rng_state(output_dir)

Where is this trainer.py? I didn't see it, thanks ~