microsoft / mup

maximal update parametrization (µP)
https://arxiv.org/abs/2203.03466
MIT License
1.37k stars 94 forks source link

PyTorch Lightning example #2

Open tchaton opened 2 years ago

tchaton commented 2 years ago

Dear team behind mup,

This is some great work! I believe providing a PyTorch Lightning example could help users adopt this library.

I even wonder if this technique could be embedded in an even less boilerplate way. I was thinking about an extension to Pytorch Lightning Tuner which would automatically apply mup and apply the µTransferable Hyperparameters.

I wondered if someone from the mup Team would be interested to investigate those ideas to democratize even further this work.

Best, T.C

edwardjhu commented 2 years ago

Hi tchaton,

Thanks for the pointer to the Lightning Tuner. We are not familiar with its usage, but from the page you linked, it looks like one can pass a model to, for example, lr_find along with a grid and the Tuner performs the necessary for loop(s) and returns the best HPs. In other words, one should be able to pass the proxy model, parametrized in muP, to the Tuner and take advantage of both right away.

Perhaps you are thinking about adding an option such as lr_find(model, mup=True, ...) to the Tuner API. The main obstacle is that we still need to let muP know which dimensions go to infinity in the limit by instantiating models of different widths. We also need the user to manually switch optimizers as well. Both are hard to hide inside a Tuner fn call.

Please let us know if you have ideas on how we can make this integration more seamless!