pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
81.39k stars 21.85k forks source link

Feature requests for optimizer overlapping #76856

Open rohan-varma opened 2 years ago

rohan-varma commented 2 years ago

🚀 The feature, motivation and pitch

After chatting with users who are interested in using optimizer overlap with DDP (and eventually FSDP) have a couple of feature requests:

  1. Support an argument set_grads_to_None, as users would like to simply pass in this flag which will make their calls to optimizer.step() functionally a noop, instead of manually having to either to this themselves on their gradients or removing their optimizer step, and
  2. Add a getter to expose the fully initialized fused optimizer on the DDP / FSDP module. This is required for things like learning late schedulers that take the optimizer as a ctor argument, as well as for use with things such as torchrec's KeyedOptimizer which wraps a vanilla torch.optim.Optimizer, as well as additional usability for things such as optimizer checkpointing.
  3. Compatibility with LR schedulers.

Alternatives

No response

Additional context

No response

cc @ezyang @gchanan @zou3519 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang

YLGH commented 2 years ago

Plan looks good to me, thanks rohan!