I made two changes which should help future users implement MuP correctly:
Previously a user could use mup.Adam, mup.AdamW, or mup.SGD (which are just the regular PyTorch optimizers) instead of the correct mup.MuAdam, mup.MuAdamW, or mup.MuSGD. Now the vanilla PyTorch optimizers cannot be accidentally accessed through the mup package.
If mup.MuAdam is used with weight decay, a warning will prompt the user to switch to mup.MuAdamW for correct weight decay scaling as described in appendix B.3 of the version of the paper which is on ArXiv. Note that doing a coord check will not indicate an incorrect implementation when using MuAdam with weight decay, but increasing model size will still eventually lead to diminishing performance unless MuAdamW is used instead (in my experience).
I made two changes which should help future users implement MuP correctly:
Previously a user could use mup.Adam, mup.AdamW, or mup.SGD (which are just the regular PyTorch optimizers) instead of the correct mup.MuAdam, mup.MuAdamW, or mup.MuSGD. Now the vanilla PyTorch optimizers cannot be accidentally accessed through the mup package.
If mup.MuAdam is used with weight decay, a warning will prompt the user to switch to mup.MuAdamW for correct weight decay scaling as described in appendix B.3 of the version of the paper which is on ArXiv. Note that doing a coord check will not indicate an incorrect implementation when using MuAdam with weight decay, but increasing model size will still eventually lead to diminishing performance unless MuAdamW is used instead (in my experience).