microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
33.63k stars 3.95k forks source link

[REQUEST] Moving a trainable model with an optimiser between GPU and CPU #5620

Open kfertakis opened 3 weeks ago

kfertakis commented 3 weeks ago

Is your feature request related to a problem? Please describe. When a deespeed model is initialised with an optimiser, the torch.nn.module.to() functionality for moving the model between devices breaks as the optimiser holds references to the model parameters and thus GPU memory is not cleared when trying to move it to CPU for example.

Describe the solution you'd like Functionality that is similar to torch.nn.module.to() for moving both model and optimiser between devices which de-allocates the previously occupied memory.

Describe alternatives you've considered The alternative is to destroy the model instance and recreate it from a checkpoint but this has a much higher time cost.