Closed zanussbaum closed 1 year ago
Hi Zach,
Thanks your for your interest in muP!
A couple things come to mind.
Happy to look into it further if doing the above still doesn't resolve the issue.
Thanks Edward for the suggestions!
Quick clarification, can you expand on what you mean by reason if it's expected
? Are there situations where that wouldn't be the case?
And yes they're from a single seed (IIRC). In practice, how many seeds should we take a look at? Does mup
only hold when averaged across seeds?
I meant using your knowledge of the specific architecture you are using to reason if there's a bug in the code. E.g., maybe it's okay if it's the output of the first layer if by design the first layer should have a small activation norm.
Our theoretical claim of hyperparameter stability is about the mean performance over random initialization. In practice, whether that holds true depends on the variance of your metrics. Here, AUPRC seems to have a larger variance than loss. However, when transferring to a much wider model (e.g., 40M to 6.7B in our paper), the variance should be small relative to the gain provided by muP, and one seed is fine in that case.
I also tried this with a transformer based model and found similar results where the transferred HPs did not result in better performance.
Hi @zanussbaum , I think the following figure may be able to solve your doubts, only when the width is large enough, muP will show the advantage over SP. It shows that when the width is greater than or equal to 2048, muP will show its advantages.
Am I understanding this right? @edwardjhu
@shjwudp Thanks! You are right that the advantage of muP over SP should become more apparent as the difference in width grows. As a direct consequence, the effect of random seeds becomes less visible in comparison. The converse of that is when the difference in width is small, the variance among random init could dominate and wider doesn't appear to be better.
Thanks! You are right that the advantage of muP over SP should become more apparent as the difference in width grows. As a direct consequence, the effect of random seeds becomes less visible in comparison. The converse of that is when the difference in width is small, the variance among random init could dominate and wider doesn't appear to be better.
Thanks for the complete explanation!
@shjwudp I interpreted that chart as one that showed the benefits of muP given that at increasing depth, the HPs do actually provide better performance whereas SP has a shifting optimum across width.
But that makes sense that we should expect some greater variance for smaller models
@edwardjhu thanks for this explanation, I think I missed this part in the paper but makes intuitive sense to me.
Is there section in the paper that describes how mup
can be affected by random seeds?
@edwardjhu I did not realize the keys of the models dict was their width, so the plots look a little different when I make that change
mup
sp
Counting the trainable params, I believe the largest model is ~15x bigger than the smallest. I'll try and see how this transfers closer to a larger increase, similar to the ~150x increase in your models!
Is there section in the paper that describes how mup can be affected by random seeds?
You can check out Definition A.3 where we states that this optimality is approximate for finite width due to randomness.
ah thanks so much! I missed that in my first few passes
@edwardjhu this is a somewhat silly question but wanted to double check. When we are transferring parameters, should we retain the mup
Readout Layers or resort to the SP layers offered in PyTorch. Above, we used the mup
Readout layers but didn't get the same transfer (even across the 150x model). But now we are seeing the transfer if we use the SP layers offered in PyTorch. Is this expected? Should they be equivalent?
Interesting! You should use the mup Readout layer on the scaled-up model.
Which plot are you referring to? Is it one of the plots you have posted already?
I don't have the plots yet as the model is running currently but can add once it finishes. The plots I was referring to were the loss and auprc curves.
I will post the results here once the runs are finished. Since they are quite large, it may take a while. I appreciate all your help!
Out of curiosity, what's the difference between using the Readout layer vs the regular PyTorch layer for the scaled up model? If I'm understanding the code correctly, it seems the LR for the layers won't be scaled like the smaller model. Do we want to keep the same initialization then?
All four curves in your loss/AUPRC plot seem to be using mup, right?
The readout layer divides the output logits by the appropriate scalar, which is a function of width. The LR should be scaled for readout layers too, which is done in our mu-optimizers. Could you elaborate on your observation? The weight initialization should also be a function of width (i.e., stddev ~ \sqrt(width)).
Yes sorry for the confusion!
Here is what I have tried so far. I am running a small model and a large model with the same parameters, where the large model is 150x larger. We are tracking the loss and AUPRC, similar to the plots shown above.
We have observed that the large model using the Readout layers and mup
optimizer, the loss and AUPRC do not perform better than the smaller model at each step. However, if we use the regular PyTorch optimizer and layers for the large model, we see significantly better performance than the mup
small model. Hopefully this clears things up! I am double checking our run with mup
Large to make sure that we indeed used the right base shapes to train the model and will follow up.
@edwardjhu ok unfortunately the 150x larger model with the Readout layers seems to be performing about the same or worse as the tiny model. I can post the loss curves in a few days once they finish, but I expect they will look similar to above.
Here's how I created the large model, perhaps there's an issue with this?
shapes.bsh
model = LargeModel()
set_base_shapes(model, "shapes.bsh") opt = MuAdam(...)
* Using the models directly
model = LargeModel() delta = MediumModel() base = TinyModel() set_base_shapes(model, base, delta=delta)
opt = MuAdam()
However, the 150x model trained without using the Readout Layer again has a 3x performance boost. Thoughts on what we can do to debug?
Issue has been fixed! Resolved by finding a better output_mult
. Thanks for the help @edwardjhu! Closing as resolved
Hey @zanussbaum can you elaborate a bit on how you found a better output_mult
? Did you use the same output_mult
for all model sizes?
@ndey96 Yes from what I understand, model families should share a similar output_mult
. I just added another HP to search over where I multiplied the output by something like [2**x for x in range(2, 10)]
. My understanding is that the output_mult
doesn't need to be as finely tuned, but just needs to be on the order of the "right" value.
Hi all, attaching the coord check plots and also a screen shot of the train loss and auprc plots. I used the Conv1D from the branch, but have also tried
I was looking at the conv plots from the examples and I noticed that one of the layers is constant across width, but after the first step is significantly smaller. Is that an issue?
Mup plot![coord_conv_mup](https://user-images.githubusercontent.com/33707069/182205732-0fe250d4-69da-4863-84e6-ed2bc72b60cd.png)
Sp Plot![coord_conv_sp](https://user-images.githubusercontent.com/33707069/182205792-162ff2e6-1c7d-4005-a952-7d6b6b5025a3.png)
Train Loss![Screen Shot 2022-08-01 at 1 12 52 PM](https://user-images.githubusercontent.com/33707069/182206024-0eb4699a-316d-46d5-9cca-2a5bfa1e964b.png)
Train AUPRC![Screen Shot 2022-08-01 at 1 12 41 PM](https://user-images.githubusercontent.com/33707069/182206054-9fecdcfa-ff85-4c28-9055-33ddcb3d64f9.png)
I also tried this with a transformer based model and found similar results where the transferred HPs did not result in better performance. I can regenerate those plots if needed.
Is this expected? What can I do to fix this? Having
mup
work would be a huge unlock for us :D