microsoft / mup

maximal update parametrization (µP)
https://arxiv.org/abs/2203.03466
MIT License
1.24k stars 88 forks source link

Conv1D Coord check looks good (I think), but μTransfer does not seem to work? #23

Closed zanussbaum closed 1 year ago

zanussbaum commented 1 year ago

Hi all, attaching the coord check plots and also a screen shot of the train loss and auprc plots. I used the Conv1D from the branch, but have also tried

I was looking at the conv plots from the examples and I noticed that one of the layers is constant across width, but after the first step is significantly smaller. Is that an issue?

Mup plot coord_conv_mup

Sp Plot coord_conv_sp

Train Loss Screen Shot 2022-08-01 at 1 12 52 PM

Train AUPRC Screen Shot 2022-08-01 at 1 12 41 PM

I also tried this with a transformer based model and found similar results where the transferred HPs did not result in better performance. I can regenerate those plots if needed.

Is this expected? What can I do to fix this? Having mup work would be a huge unlock for us :D

edwardjhu commented 1 year ago

Hi Zach,

Thanks your for your interest in muP!

A couple things come to mind.

Happy to look into it further if doing the above still doesn't resolve the issue.

zanussbaum commented 1 year ago

Thanks Edward for the suggestions!

Quick clarification, can you expand on what you mean by reason if it's expected? Are there situations where that wouldn't be the case?

And yes they're from a single seed (IIRC). In practice, how many seeds should we take a look at? Does mup only hold when averaged across seeds?

edwardjhu commented 1 year ago

I meant using your knowledge of the specific architecture you are using to reason if there's a bug in the code. E.g., maybe it's okay if it's the output of the first layer if by design the first layer should have a small activation norm.

Our theoretical claim of hyperparameter stability is about the mean performance over random initialization. In practice, whether that holds true depends on the variance of your metrics. Here, AUPRC seems to have a larger variance than loss. However, when transferring to a much wider model (e.g., 40M to 6.7B in our paper), the variance should be small relative to the gain provided by muP, and one seed is fine in that case.

shjwudp commented 1 year ago

I also tried this with a transformer based model and found similar results where the transferred HPs did not result in better performance.

Hi @zanussbaum , I think the following figure may be able to solve your doubts, only when the width is large enough, muP will show the advantage over SP. It shows that when the width is greater than or equal to 2048, muP will show its advantages.

Am I understanding this right? @edwardjhu

image

edwardjhu commented 1 year ago

@shjwudp Thanks! You are right that the advantage of muP over SP should become more apparent as the difference in width grows. As a direct consequence, the effect of random seeds becomes less visible in comparison. The converse of that is when the difference in width is small, the variance among random init could dominate and wider doesn't appear to be better.

shjwudp commented 1 year ago

Thanks! You are right that the advantage of muP over SP should become more apparent as the difference in width grows. As a direct consequence, the effect of random seeds becomes less visible in comparison. The converse of that is when the difference in width is small, the variance among random init could dominate and wider doesn't appear to be better.

Thanks for the complete explanation!

zanussbaum commented 1 year ago

@shjwudp I interpreted that chart as one that showed the benefits of muP given that at increasing depth, the HPs do actually provide better performance whereas SP has a shifting optimum across width.

But that makes sense that we should expect some greater variance for smaller models

zanussbaum commented 1 year ago

@edwardjhu thanks for this explanation, I think I missed this part in the paper but makes intuitive sense to me.

Is there section in the paper that describes how mup can be affected by random seeds?

zanussbaum commented 1 year ago

@edwardjhu I did not realize the keys of the models dict was their width, so the plots look a little different when I make that change mup coord_conv_mup

sp coord_conv_sp

Counting the trainable params, I believe the largest model is ~15x bigger than the smallest. I'll try and see how this transfers closer to a larger increase, similar to the ~150x increase in your models!

edwardjhu commented 1 year ago

Is there section in the paper that describes how mup can be affected by random seeds?

You can check out Definition A.3 where we states that this optimality is approximate for finite width due to randomness.

zanussbaum commented 1 year ago

ah thanks so much! I missed that in my first few passes

zanussbaum commented 1 year ago

@edwardjhu this is a somewhat silly question but wanted to double check. When we are transferring parameters, should we retain the mup Readout Layers or resort to the SP layers offered in PyTorch. Above, we used the mup Readout layers but didn't get the same transfer (even across the 150x model). But now we are seeing the transfer if we use the SP layers offered in PyTorch. Is this expected? Should they be equivalent?

edwardjhu commented 1 year ago

Interesting! You should use the mup Readout layer on the scaled-up model.

Which plot are you referring to? Is it one of the plots you have posted already?

zanussbaum commented 1 year ago

I don't have the plots yet as the model is running currently but can add once it finishes. The plots I was referring to were the loss and auprc curves.

I will post the results here once the runs are finished. Since they are quite large, it may take a while. I appreciate all your help!

Out of curiosity, what's the difference between using the Readout layer vs the regular PyTorch layer for the scaled up model? If I'm understanding the code correctly, it seems the LR for the layers won't be scaled like the smaller model. Do we want to keep the same initialization then?

edwardjhu commented 1 year ago

All four curves in your loss/AUPRC plot seem to be using mup, right?

The readout layer divides the output logits by the appropriate scalar, which is a function of width. The LR should be scaled for readout layers too, which is done in our mu-optimizers. Could you elaborate on your observation? The weight initialization should also be a function of width (i.e., stddev ~ \sqrt(width)).

zanussbaum commented 1 year ago

Yes sorry for the confusion!

Here is what I have tried so far. I am running a small model and a large model with the same parameters, where the large model is 150x larger. We are tracking the loss and AUPRC, similar to the plots shown above.

We have observed that the large model using the Readout layers and mup optimizer, the loss and AUPRC do not perform better than the smaller model at each step. However, if we use the regular PyTorch optimizer and layers for the large model, we see significantly better performance than the mup small model. Hopefully this clears things up! I am double checking our run with mup Large to make sure that we indeed used the right base shapes to train the model and will follow up.

zanussbaum commented 1 year ago

@edwardjhu ok unfortunately the 150x larger model with the Readout layers seems to be performing about the same or worse as the tiny model. I can post the loss curves in a few days once they finish, but I expect they will look similar to above.

Here's how I created the large model, perhaps there's an issue with this?

set_base_shapes(model, "shapes.bsh") opt = MuAdam(...)

* Using the models directly

model = LargeModel() delta = MediumModel() base = TinyModel() set_base_shapes(model, base, delta=delta)

opt = MuAdam()



However, the 150x model trained without using the Readout Layer again has a 3x performance boost. Thoughts on what we can do to debug?
zanussbaum commented 1 year ago

Issue has been fixed! Resolved by finding a better output_mult. Thanks for the help @edwardjhu! Closing as resolved

ndey96 commented 1 year ago

Hey @zanussbaum can you elaborate a bit on how you found a better output_mult? Did you use the same output_mult for all model sizes?

zanussbaum commented 1 year ago

@ndey96 Yes from what I understand, model families should share a similar output_mult. I just added another HP to search over where I multiplied the output by something like [2**x for x in range(2, 10)]. My understanding is that the output_mult doesn't need to be as finely tuned, but just needs to be on the order of the "right" value.