Closed xwjabc closed 2 years ago
Hi Weijian,
The mup coordinate check curves do use mu-optimizers. The conversion happens internally: https://github.com/microsoft/mup/blob/eac6f1dd715ccc84d571d713b9525ab0c0fcfda3/mup/coord_check.py#L441
Does this address your concern?
Thank you for your quick reply! Yes, it addresses my concern. I just found out the auto conversion right before seeing your reply :laughing:
Thank you for your great work! When trying the coord check in the examples, I noticed that the original optimizers (e.g., sgd, adam) are used instead of the muP optimizers (e.g., musgd, muadam). However, according to the Table 8 in the paper, the optimizers should be adjusted accordingly to make activations bounded. Is there any reason behind the use of original optimizers?