microsoft / mup

maximal update parametrization (µP)
https://arxiv.org/abs/2203.03466
MIT License
1.25k stars 90 forks source link

Optimizers for coord check #16

Closed xwjabc closed 2 years ago

xwjabc commented 2 years ago

Thank you for your great work! When trying the coord check in the examples, I noticed that the original optimizers (e.g., sgd, adam) are used instead of the muP optimizers (e.g., musgd, muadam). However, according to the Table 8 in the paper, the optimizers should be adjusted accordingly to make activations bounded. Is there any reason behind the use of original optimizers?

edwardjhu commented 2 years ago

Hi Weijian,

The mup coordinate check curves do use mu-optimizers. The conversion happens internally: https://github.com/microsoft/mup/blob/eac6f1dd715ccc84d571d713b9525ab0c0fcfda3/mup/coord_check.py#L441

Does this address your concern?

xwjabc commented 2 years ago

Thank you for your quick reply! Yes, it addresses my concern. I just found out the auto conversion right before seeing your reply :laughing: