microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.27k stars 4.09k forks source link

load_checkpoint nuances #647

Open stas00 opened 3 years ago

stas00 commented 3 years ago

I have a few questions about model checkpointing: https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html

I'm trying to figure out how to best integrate deepspeed into that area.

  1. If we already have code that does checkpointing of the model/optim/scheduler - so in a simplified way we have the basic:
torch.save(self.optimizer.state_dict(),  d)
torch.save(self.lr_scheduler.state_dict(), d)
torch.save(self.model.state_dict, d)

where self.model.state_dict is the "client" model. And then the same for loading.

Now when I call deepspeed.DeepSpeedEngine.save_checkpoint I get 4 things saved engine/model/optim/scheduler

When it comes to loading it back, I do deepspeed.DeepSpeedEngine.load_checkpoint - do I need to somehow update our trainer self.scheduler, self.optimizer from that loaded object? I don't see an API to do that?

Or would it be simpler to not delegate to DS any savings other than its own engine and save model/optim/scheduler and restore those separately (since we are doing it anyway if the trainer is not running under DeepSpeed).

To exemplify with code:

We start with:

model, optimizer, _, lr_scheduler = deepspeed.initialize(...)
self.deepspeed = model # DeepSpeedEngine object
self.model = model.module
self.optimizer = optimizer
self.lr_scheduler = lr_scheduler

So the new saving code would be:

torch.save(self.optimizer.state_dict(),  d)
torch.save(self.lr_scheduler.state_dict(), d)
torch.save(self.model.state_dict, d)
if self.deepspeed:
    self.deepspeed.save_checkpoint(d)

and then on load again leave most of our code intact and just update the engine:

self.optimizer.load_state_dict(torch.load(os.path.join(model_path, "optimizer.pt")...
self.lr_scheduler.load_state_dict(torch.load(os.path.join(model_path, "scheduler.pt")...
self.model = self.model.from_pretrained(model_path)
if self.deepspeed:
    self.deepspeed.load_checkpoint(model_path, load_optimizer_states=False, load_lr_scheduler_states=False)

Am I wasting resources saving/loading the separate components, since deepspeed will have to do it anyway? I'm asking since our code is spread around and we don't always load all components together. e.g. sched/optim are loaded separately, so we end up loading the model twice because deepspeed doesn't separate the components. i.e. we can't say not to load the model (but can skip loading the sched/optim)

Alternatively, I could just do:

if self.deepspeed:
    self.deepspeed.load_checkpoint(model_path, load_optimizer_states=True, load_lr_scheduler_states=True)
else:
    self.optimizer.load_state_dict(torch.load(os.path.join(model_path, "optimizer.pt")...
    self.lr_scheduler.load_state_dict(torch.load(os.path.join(model_path, "scheduler.pt")...
    self.model = self.model.from_pretrained(model_path)

and if this is done, do we get all the previous variables .e.g self.optimizer that we assigned at the beginning from deepspeed.initialize updated to the loaded-from-the-checkpoint values - or do we now somehow have to recreate all those variables?

model, optimizer, _, lr_scheduler = self.deepspeed.somehow_get_each_component_again
self.deepspeed = model 
self.model = model.module
self.optimizer = optimizer
self.lr_scheduler = lr_scheduler

I hope my question is easy to understand.

If I were to ask it in a different way: what happens on deepspeed.load_checkpoint and where things go and what needs to be done besides loading the checkpoint. An example would have been very helpful.


  1. And one more question: we have code that checks whether the saved model dir has saved optim/sched:
            and os.path.isfile(os.path.join(model_path, "optimizer.pt"))
            and os.path.isfile(os.path.join(model_path, "scheduler.pt"))

    and loads them before training. How would you approach that for deepspeed, which filesystem pattern to match to identify that there is a saved DeepSpeed checkpoint that can be loaded?

I typically see a global_step0 folder. Is it always the same, or perhaps you have a discovery function, so that we could do something like:

if deepspeed.has_checkpoint(path):
    deepspeed.load_checkpoint(path)

I suppose we could try/except too, but that's not very clean if there is/could be an API to do that.

And thinking more about it, since deepspeed.load_checkpoint will return (None, ?) if nothing found at path - will this invalidate the existing deepspeed object?

Thank you.

tjruwase commented 3 years ago

Let me try to understand the first question. If the optimizer and lr_scheduler are returned by deepspeed.initialize(...), then save_checkpoint() and load_checkpoint() should work to restore both components. In particular, your snippet below just should work to instantiate the underlying optimizer and lr_scheduler variables appropriately without any need for extra copying.

if self.deepspeed:
    self.deepspeed.load_checkpoint(model_path, load_optimizer_states=True, load_lr_scheduler_states=True)
else:
    self.optimizer.load_state_dict(torch.load(os.path.join(model_path, "optimizer.pt")...
    self.lr_scheduler.load_state_dict(torch.load(os.path.join(model_path, "scheduler.pt")...
    self.model = self.model.from_pretrained(model_path)

Does this address your first question? Or are you experiencing something different with the code?

stas00 commented 3 years ago

Yes, thank you!

The problem is that such APIs are ambiguous. If deepspeed.initialize returns 3 variables:

deepspeed_engine, scheduler, optimizer = deepspeed.initialize()

how can a user guess that calling:

deepspeed_engine.load_checkpoint()

will affect the scheduler, and optimizer variables. Does my quandary make more sense now?

If the API were to be for example:

deepspeed_engine, scheduler, optimizer = deepspeed.initialize()
scheduler, optimizer = deepspeed_engine.scheduler, deepspeed_engine.optimizer

then a user knows that those are parts of the engine and not separate entities. and so updating that engine will update its attributes as well.

tjruwase commented 3 years ago

This is so spot on. I completely agree. I feel the API would be similar to that if we could do them all over again. In fact, I would tweak your proposal ever so lightly

deepspeed_engine = deepspeed.initialize(..)
assert deepspeed_engine.has_module() and deepspeed_engine.has_scheduler() and deepspeed_engine.has_optimizer()
model = deepspeed_engine.module 
scheduler = deepspeed_engine.scheduler
optimizer = deepspeed_engine.optimizer

Do you think this could help with the confusion? We will discuss if we can move towards this.

stas00 commented 3 years ago

Yes, your proposed version is perfect, @tjruwase.

stas00 commented 3 years ago

My only gripe in general with most pytorch implementations is that model/module duplicity. While I understand that module implies a subclass of nn.Module, and model doesn't have to be as long as it implements __call__, but I still find this somewhat vexing at times. Especially since some wrappers use .model and others .module to access the thing they wrapped - I guess it'd have been different if it were nn.Model in first place.

While working on integrating DS, we changed the trainer to have a clear self.model_wrapped to point to the outmost model if any (e.g. DDP(Deepspeed(Transformer))) and self.model for the actual transformers model, and self.deepspeed for the deepspeed engine, so at use-point one doesn't need to read code to understand which of the 3 model/modules to use.

tjruwase commented 3 years ago

@stas00 I totally get you. My background is PL, where elegance, clarity, and cleanliness were required in programming constructs :).

stas00 commented 3 years ago

Is this on DeepSpeed TODO list and it is ok to close this?

tjruwase commented 3 years ago

@stas00, sorry this needs to remain opened for a bit longer in order to track it. Not yet got a chance to work on this.

Moldoteck commented 3 years ago

Is there a way to get the param_groups property of the optimizer if the optimizer was created by json config?

omerXfaruq commented 2 years ago

@stas00, sorry this needs to remain opened for a bit longer in order to track it. Not yet got a chance to work on this.

What about now? @tjruwase :D

tjruwase commented 2 years ago

@FarukOzderim, thanks for bringing attention to this again. I must confess that it has fallen off my radar. Can you please explain your interest in this issue and your desired outcome? Thanks

omerXfaruq commented 2 years ago

I have no desire or interest, just saw it still was hanging after 1 year and thought could be beneficial to mention :)

tjruwase commented 2 years ago

Fair enough :). Unfortunately, we are still chugging along with the counter-intuitive API for now. If you have bandwidth to contribute a PR on this, it will be greatly appreciated.

OsebeSammi commented 1 year ago

From this documentation https://deepspeed.readthedocs.io/en/latest/initialize.html#training-initialization, it looks like it is possible to pass an optimiser separately when initializing for training:

model, optimizer, _, _ = deepspeed.initialize(model=model,
                                              config_params=utils.get_deepspeed_config(),
                                              optimizer=checkpoint_optimiser,
                                              model_parameters=parameters)