Feature requests for optimizer overlapping

🚀 The feature, motivation and pitch

After chatting with users who are interested in using optimizer overlap with DDP (and eventually FSDP) have a couple of feature requests:

Support an argument set_grads_to_None, as users would like to simply pass in this flag which will make their calls to optimizer.step() functionally a noop, instead of manually having to either to this themselves on their gradients or removing their optimizer step, and
Add a getter to expose the fully initialized fused optimizer on the DDP / FSDP module. This is required for things like learning late schedulers that take the optimizer as a ctor argument, as well as for use with things such as torchrec's KeyedOptimizer which wraps a vanilla torch.optim.Optimizer, as well as additional usability for things such as optimizer checkpointing.
Compatibility with LR schedulers.

Alternatives

No response

Additional context

No response

cc @ezyang @gchanan @zou3519 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang

pytorch / pytorch

Feature requests for optimizer overlapping #76856

🚀 The feature, motivation and pitch

Alternatives

Additional context