[REQUEST] Moving a trainable model with an optimiser between GPU and CPU

Is your feature request related to a problem? Please describe. When a deespeed model is initialised with an optimiser, the torch.nn.module.to() functionality for moving the model between devices breaks as the optimiser holds references to the model parameters and thus GPU memory is not cleared when trying to move it to CPU for example.

Describe the solution you'd like Functionality that is similar to torch.nn.module.to() for moving both model and optimiser between devices which de-allocates the previously occupied memory.

Describe alternatives you've considered The alternative is to destroy the model instance and recreate it from a checkpoint but this has a much higher time cost.

microsoft / DeepSpeed

[REQUEST] Moving a trainable model with an optimiser between GPU and CPU #5620