microsoft / mup

maximal update parametrization (µP)
https://arxiv.org/abs/2203.03466
MIT License
1.24k stars 88 forks source link

Refactor: Addressing Sources of User Error #73

Open thomasfortin1 opened 2 months ago

thomasfortin1 commented 2 months ago

I made two changes which should help future users implement MuP correctly:

Previously a user could use mup.Adam, mup.AdamW, or mup.SGD (which are just the regular PyTorch optimizers) instead of the correct mup.MuAdam, mup.MuAdamW, or mup.MuSGD. Now the vanilla PyTorch optimizers cannot be accidentally accessed through the mup package.

If mup.MuAdam is used with weight decay, a warning will prompt the user to switch to mup.MuAdamW for correct weight decay scaling as described in appendix B.3 of the version of the paper which is on ArXiv. Note that doing a coord check will not indicate an incorrect implementation when using MuAdam with weight decay, but increasing model size will still eventually lead to diminishing performance unless MuAdamW is used instead (in my experience).

thomasfortin1 commented 2 months ago

@microsoft-github-policy-service agree