reclaiming memory for inference

stas00 commented 3 years ago

While https://github.com/microsoft/DeepSpeed/pull/896 solves the leak problem, ideally we should also have a new method to free all optimizer/scheduler related parts to pave wave for inference. In some environments like google colab general RAM is very scarce so every bit counts.

Here is one way to approach this:

engine, optimizer, scheduler = deepspeed.initialize(...)
# do the training and 
# then before inference do:
engine.free_optimizer_and_scheduler()
optimizer = None
scheduler = None
# it's then user's responsibility to make sure they have no remaining references to optimizer/scheduler objects for them to be freed.

with a new deepspeed method:

def free_optimizer_and_scheduler(self):
    self.lr_scheduler.optimizer = None
    self.optimizer.optimizer = None
    self.lr_scheduler = None
    self.optimizer = None

That way after training is done a lion part of the general RAM used by deepspeed is reclaimed. There are probably other bits to manually clean to reclaim even more.

Let me know if it sounds good to you and I will make another PR with this feature. We can in the future extend it if need be to support other things to benefit inference.

Thank you.

@jeffra, @RezaYazdaniAminabadi

RezaYazdaniAminabadi commented 3 years ago

I think these things that you mentioned makes sense. We also need to make sure such freeing or not allocating the memory for those training-related parts not happen in an automatic way when we are at inference mode. I mean that user don't need to specifically call a function like __free_optimizer_and_scheduler to free those memory, but have an easy way of switching mode like eval__ mode in PyTorch.

stas00 commented 3 years ago

I agree! That would be nice indeed.

But torch's model.eval()/train() just turns some flags on/off, how would you deal with the user switching back from eval to train in deepspeed? Do you save the config and simply re-init the parts that were freed for eval.

RezaYazdaniAminabadi commented 3 years ago

yes, that can be a viable option, as we have to eventually control the checkpointing through deepspeed if we want to seamlessly switch between these two modes. I think we can hide all the necessary operations for the switching between inference and training modes, and the user still feels it's like switching a flag on and off.

stas00 commented 3 years ago

yes, please!

microsoft / DeepSpeed

reclaiming memory for inference #897