Open stas00 opened 3 years ago
Let me try to understand the first question. If the optimizer and lr_scheduler are returned by deepspeed.initialize(...)
, then save_checkpoint()
and load_checkpoint()
should work to restore both components. In particular, your snippet below just should work to instantiate the underlying optimizer and lr_scheduler variables appropriately without any need for extra copying.
if self.deepspeed:
self.deepspeed.load_checkpoint(model_path, load_optimizer_states=True, load_lr_scheduler_states=True)
else:
self.optimizer.load_state_dict(torch.load(os.path.join(model_path, "optimizer.pt")...
self.lr_scheduler.load_state_dict(torch.load(os.path.join(model_path, "scheduler.pt")...
self.model = self.model.from_pretrained(model_path)
Does this address your first question? Or are you experiencing something different with the code?
Yes, thank you!
The problem is that such APIs are ambiguous. If deepspeed.initialize
returns 3 variables:
deepspeed_engine, scheduler, optimizer = deepspeed.initialize()
how can a user guess that calling:
deepspeed_engine.load_checkpoint()
will affect the scheduler
, and optimizer
variables. Does my quandary make more sense now?
If the API were to be for example:
deepspeed_engine, scheduler, optimizer = deepspeed.initialize()
scheduler, optimizer = deepspeed_engine.scheduler, deepspeed_engine.optimizer
then a user knows that those are parts of the engine and not separate entities. and so updating that engine will update its attributes as well.
This is so spot on. I completely agree. I feel the API would be similar to that if we could do them all over again. In fact, I would tweak your proposal ever so lightly
deepspeed_engine = deepspeed.initialize(..)
assert deepspeed_engine.has_module() and deepspeed_engine.has_scheduler() and deepspeed_engine.has_optimizer()
model = deepspeed_engine.module
scheduler = deepspeed_engine.scheduler
optimizer = deepspeed_engine.optimizer
Do you think this could help with the confusion? We will discuss if we can move towards this.
Yes, your proposed version is perfect, @tjruwase.
My only gripe in general with most pytorch implementations is that model
/module
duplicity. While I understand that module
implies a subclass of nn.Module
, and model
doesn't have to be as long as it implements __call__
, but I still find this somewhat vexing at times. Especially since some wrappers use .model
and others .module
to access the thing they wrapped - I guess it'd have been different if it were nn.Model
in first place.
While working on integrating DS, we changed the trainer to have a clear self.model_wrapped
to point to the outmost model if any (e.g. DDP(Deepspeed(Transformer))) and self.model
for the actual transformers model, and self.deepspeed
for the deepspeed engine, so at use-point one doesn't need to read code to understand which of the 3 model/modules to use.
@stas00 I totally get you. My background is PL, where elegance, clarity, and cleanliness were required in programming constructs :).
Is this on DeepSpeed TODO list and it is ok to close this?
@stas00, sorry this needs to remain opened for a bit longer in order to track it. Not yet got a chance to work on this.
Is there a way to get the param_groups property of the optimizer if the optimizer was created by json config?
@stas00, sorry this needs to remain opened for a bit longer in order to track it. Not yet got a chance to work on this.
What about now? @tjruwase :D
@FarukOzderim, thanks for bringing attention to this again. I must confess that it has fallen off my radar. Can you please explain your interest in this issue and your desired outcome? Thanks
I have no desire or interest, just saw it still was hanging after 1 year and thought could be beneficial to mention :)
Fair enough :). Unfortunately, we are still chugging along with the counter-intuitive API for now. If you have bandwidth to contribute a PR on this, it will be greatly appreciated.
From this documentation https://deepspeed.readthedocs.io/en/latest/initialize.html#training-initialization, it looks like it is possible to pass an optimiser separately when initializing for training:
model, optimizer, _, _ = deepspeed.initialize(model=model,
config_params=utils.get_deepspeed_config(),
optimizer=checkpoint_optimiser,
model_parameters=parameters)
I have a few questions about model checkpointing: https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html
I'm trying to figure out how to best integrate deepspeed into that area.
where
self.model.state_dict
is the "client" model. And then the same for loading.Now when I call
deepspeed.DeepSpeedEngine.save_checkpoint
I get 4 things saved engine/model/optim/schedulerWhen it comes to loading it back, I do
deepspeed.DeepSpeedEngine.load_checkpoint
- do I need to somehow update our trainerself.scheduler
,self.optimizer
from that loaded object? I don't see an API to do that?Or would it be simpler to not delegate to DS any savings other than its own engine and save model/optim/scheduler and restore those separately (since we are doing it anyway if the trainer is not running under DeepSpeed).
To exemplify with code:
We start with:
So the new saving code would be:
and then on load again leave most of our code intact and just update the engine:
Am I wasting resources saving/loading the separate components, since deepspeed will have to do it anyway? I'm asking since our code is spread around and we don't always load all components together. e.g. sched/optim are loaded separately, so we end up loading the model twice because deepspeed doesn't separate the components. i.e. we can't say not to load the model (but can skip loading the sched/optim)
Alternatively, I could just do:
and if this is done, do we get all the previous variables .e.g
self.optimizer
that we assigned at the beginning fromdeepspeed.initialize
updated to the loaded-from-the-checkpoint values - or do we now somehow have to recreate all those variables?I hope my question is easy to understand.
If I were to ask it in a different way: what happens on
deepspeed.load_checkpoint
and where things go and what needs to be done besides loading the checkpoint. An example would have been very helpful.and loads them before training. How would you approach that for deepspeed, which filesystem pattern to match to identify that there is a saved DeepSpeed checkpoint that can be loaded?
I typically see a
global_step0
folder. Is it always the same, or perhaps you have a discovery function, so that we could do something like:I suppose we could
try/except
too, but that's not very clean if there is/could be an API to do that.And thinking more about it, since
deepspeed.load_checkpoint
will return(None, ?)
if nothing found atpath
- will this invalidate the existing deepspeed object?Thank you.