microsoft / mup

maximal update parametrization (µP)
https://arxiv.org/abs/2203.03466
MIT License
1.24k stars 88 forks source link

Coord check looks good, but μTransfer is not working as expected #22

Closed shjwudp closed 1 year ago

shjwudp commented 1 year ago

Hello, μP team! Very excited to see you open source your excellent work! I was looking to apply μP on our work, and on Megatron-DeepSpeed I modified the training script as suggested in the tutorial, set the infshape, reset parameters initialization, put on MuAdam, and got a coord_check that looked successful. But when we transfer the learning rate that performed well on the 350M GPT model to the large model 1.3B, we found that the 1.3B could not withstand such a large learning rate and eventually produced NaN.

I was wondering what details might not have been taken into account, or the conditions were not met, causing μTransfer to fail. How should I debug, or μTransfer just won't work under this condition?

The following is the experimental information.

image

image

350M -> 1.3B GPT model μTransfer training loss( tensorborad link ): image

I think it may be a bit redundant, but if you are interested, the transformation of μP is here:

  1. Replace output layer with MuReadout, https://github.com/shjwudp/Megatron-LM/blob/mup/megatron/model/gpt_model.py#L250
  2. Make sure to use 1/d instead of 1/sqrt(d) attention scaling, https://github.com/shjwudp/Megatron-LM/blob/mup/megatron/model/transformer.py#L175
  3. Set infshape and do mup parameter initiliaze, https://github.com/shjwudp/Megatron-LM/blob/mup/pretrain_gpt.py#L110
  4. Put on MuAdam, https://github.com/shjwudp/Megatron-LM/blob/mup/megatron/optimizer/__init__.py#L65
  5. Implement the equivalent MuReadout._rescale_parameters operation, https://github.com/shjwudp/Megatron-LM/blob/mup/megatron/mpu/layers.py#L191
  6. Modify lr scheduler to update lr according to width, https://github.com/shjwudp/Megatron-LM/blob/mup/megatron/learning_rates.py#L127
  7. Coord check, https://github.com/shjwudp/Megatron-LM/blob/mup/megatron/mup_utils.py#L16
edwardjhu commented 1 year ago

Hi shjwudp,

Thanks for your interest in our work!

Your coordinate check plots seem identical across time steps, which is a sign that the learning rate is too small for the function to change. Can you try rerunning with a larger learning rate? It's possible that with a moderately larger learning rate, the muP run might blow up after a couple steps, in which case we can look into it further.

shjwudp commented 1 year ago

Hi Edward, thank you very much, your advice saved me. A larger learning rate exposed the problem, the plot showed jitter, I debugged and fixed the problem, and now the plot is smooth and shows good increase in value. I'm going to try the big learning rate μTransfer on this.

Although it looks good now, the working principle is too hard for me, and the muP is really amazing.

shjwudp commented 1 year ago

Hi, @edwardjhu I've recently done some experiments, an extension of the previous discussion. I found that transferring the same hyperparameters from a 350M model to 1.3B scale works fine, but transferring to a larger model 2.7B blowup, does that mean my parameters are too aggressive? how should i avoid this?

My coord: image The same hyperparameters, 1.3B model and 2.7B model comparison: https://tensorboard.dev/experiment/RirdggEZS8O2rRU9clEy0g/#scalars

shjwudp commented 1 year ago

Another question is that the transformer example and mutransformers use different initialization methods, (init_std / d_model) ** 0.5 vs init_std * width_mult ** -0.5, are these two formulas equivalent in some sense? Will there be pros and cons?

edwardjhu commented 1 year ago

Thanks for your patience, Jianbin.

There are many considerations when training a very large model. In some sense, mup is a necessary but insufficient condition for successful training of large models. Other factors include the use of weight decay and floating point precision. Hope this can help with your investigation!

Another question is that the transformer example and mutransformers use different initialization methods, (init_std / d_model) * 0.5 vs init_std width_mult ** -0.5, are these two formulas equivalent in some sense? Will there be pros and cons?

They are equivalent up to a constant.

leenachennuru commented 9 months ago

Hi Edward, thank you very much, your advice saved me. A larger learning rate exposed the problem, the plot showed jitter, I debugged and fixed the problem, and now the plot is smooth and shows good increase in value. I'm going to try the big learning rate μTransfer on this.

Although it looks good now, the working principle is too hard for me, and the muP is really amazing.

Hi Jiabin, Could you share info on what caused the jitter in your coord check plots? Its possible that I have a similar issue (#58).

Thanks!