microsoft / mup

maximal update parametrization (µP)
https://arxiv.org/abs/2203.03466
MIT License
1.37k stars 94 forks source link

Some questions about the implementation of muP. #57

Open lepodl opened 1 year ago

lepodl commented 1 year ago

I have some questions about the implementation of muP in the rescale and transfer hyperparameter. Specifically, in

  1. https://github.com/microsoft/mup/blob/a33ea802bcef1d7744057e34ff00d1a5d7e3d7c4/mup/layer.py#L83. As mentioned in Tabel 8 in your paper TP V, the bias should scale to O(1/fan_in), but I notice that you multiply a width_mult here. Is that correct or did I miss something?
  2. Similarly, in https://github.com/microsoft/mup/blob/a33ea802bcef1d7744057e34ff00d1a5d7e3d7c4/mup/layer.py#L51, I think the weight of readout parameter should not scale because it obeys O(1) as in Tabel 8 in TP V.

I would greatly appreciate it if you could take the time to answer my question!