Closed FlamingHorizon closed 1 year ago
Hi!
This sounds like a precision issue or a bug in your code. You shouldn't have to adjust the warmup ratio if you are training for the same number of steps. Are the batches ordered in the same way in these runs?
It doesn't matter since you can still model an arbitrary probability. Zero init can be helpful since it gets rid of the effect of the initial GP.
Hope this helps!
Thanks for your reply!
Yes, the batches are ordered in the same way. The models are trained in full float32. A "silly" question: did you mean that if a model converges in small width, it is guaranteed to converge in larger width if mu-transferred correctly?
Yes, the larger model should never do worse than the small model in muP, especially if the entire training procedure (number of steps, etc) is fixed.
From: FlamingHorizon @.> Date: Tuesday, March 14, 2023 at 9:04 PM To: microsoft/mup @.> Cc: Subscribed @.***> Subject: Re: [microsoft/mup] Questions on learning schedule and binary classification (Issue #40)
Thanks for your reply!
Yes, the batches are ordered in the same way. The models are trained in full float32. A "silly" question: did you mean that if a model converges in small width, it is guaranteed to converge in larger width if mu-transferred correctly?
— Reply to this email directly, view it on GitHubhttps://github.com/microsoft/mup/issues/40#issuecomment-1469183659, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMWHHM5F3TJ3RSWFGN5TCALW4EPT5ANCNFSM6AAAAAAVV5FDEU. You are receiving this because you are subscribed to this thread.Message ID: @.***>
Yes, the larger model should never do worse than the small model in muP, especially if the entire training procedure (number of steps, etc) is fixed. From: FlamingHorizon @.> Date: Tuesday, March 14, 2023 at 9:04 PM To: microsoft/mup @.> Cc: Subscribed @.> Subject: Re: [microsoft/mup] Questions on learning schedule and binary classification (Issue #40) Thanks for your reply! Yes, the batches are ordered in the same way. The models are trained in full float32. A "silly" question: did you mean that if a model converges in small width, it is guaranteed to converge in larger width if mu-transferred correctly? — Reply to this email directly, view it on GitHub<#40 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMWHHM5F3TJ3RSWFGN5TCALW4EPT5ANCNFSM6AAAAAAVV5FDEU. You are receiving this because you are subscribed to this thread.Message ID: @.>
I see, thanks. The following is some results I got yesterday for training loss @ 10k steps. I was tuning lr + output_mult + initializer_range. It seems that "the wider, the better" holds true if the model does not diverge.
I'll have a double check on my code, but please also let me know if anything looks wired in this table!
Are you using low precision?
On Tue, Mar 14, 2023, 10:42 PM FlamingHorizon @.***> wrote:
Yes, the larger model should never do worse than the small model in muP, especially if the entire training procedure (number of steps, etc) is fixed. From: FlamingHorizon @.> Date: Tuesday, March 14, 2023 at 9:04 PM To: microsoft/mup @.> Cc: Subscribed @.> Subject: Re: [microsoft/mup] Questions on learning schedule and binary classification (Issue #40 https://github.com/microsoft/mup/issues/40) Thanks for your reply! Yes, the batches are ordered in the same way. The models are trained in full float32. A "silly" question: did you mean that if a model converges in small width, it is guaranteed to converge in larger width if mu-transferred correctly? — Reply to this email directly, view it on GitHub<#40 (comment) https://github.com/microsoft/mup/issues/40#issuecomment-1469183659>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMWHHM5F3TJ3RSWFGN5TCALW4EPT5ANCNFSM6AAAAAAVV5FDEU https://github.com/notifications/unsubscribe-auth/AMWHHM5F3TJ3RSWFGN5TCALW4EPT5ANCNFSM6AAAAAAVV5FDEU. You are receiving this because you are subscribed to this thread.Message ID: @.>
I see, thanks. The following is some results I got yesterday for training loss @ 10k steps. I was tuning lr + output_mult + initializer_range. It seems that "the wider, the better" holds true if the model does not diverge.
I'll have a double check on the code, but please also let me know if something looks wired in this table!
[image: image] https://user-images.githubusercontent.com/16420121/225200186-1cb46452-4edd-4be4-898b-960c932ab502.png
— Reply to this email directly, view it on GitHub https://github.com/microsoft/mup/issues/40#issuecomment-1469271241, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMWHHM4VKEJ7C7DUUXARWATW4E3BLANCNFSM6AAAAAAVV5FDEU . You are receiving this because you commented.Message ID: @.***>
No, the model was trained in float32
Is this a pre layernorm transformer?
On Tue, Mar 14, 2023, 10:45 PM FlamingHorizon @.***> wrote:
No, the model was trained in float32
— Reply to this email directly, view it on GitHub https://github.com/microsoft/mup/issues/40#issuecomment-1469273323, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMWHHMZQDNZ7X3UGSVNMLQLW4E3MLANCNFSM6AAAAAAVV5FDEU . You are receiving this because you commented.Message ID: @.***>
yes, it was exactly a gpt-2 with large depth, and batch_size = 24 (too small?)
@TevenLeScao has been working on GPT-2 with Megatron-deepspeed as well. @TevenLeScao does anything here ring any bells for you for what the issue could be?
@thegregyang I found the problem, it was because the default lr_scheduler configuration of my codebase broke the re-group of optimizer.
A further question: if I intend to load an existing checkpoint before starting, does it make sense to re-normalize the parameters of matrix-like tensors to be zero-mean? Did you do experiments that deal with such situations (and I saw in your paper that non-gaussian initialization also slows down the process)
OK I'm glad you solved your issue.
We didn't do any experiment modifying pretrained networks. In general I don't expect our understanding of training from scratch carries over unmodified to training from checkpoint (that is trained without muP). So in summary I unfortunately have no advice for your situation.
On Wed, Mar 22, 2023, 2:18 PM FlamingHorizon @.***> wrote:
@thegregyang https://github.com/thegregyang I found the problem, it was because the default lr_scheduler configuration of my codebase broke the re-group of optimizer.
A further question: if I intend to load an existing checkpoint before starting, does it make sense to re-normalize the parameters of matrix-like tensors to be zero-mean? Did you do experiments that deal with such situations (and I saw in your paper that non-gaussian initialization also slows down the process)
— Reply to this email directly, view it on GitHub https://github.com/microsoft/mup/issues/40#issuecomment-1478865217, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMWHHMZMSDVPLBB7WGBBWGDW5JVPFANCNFSM6AAAAAAVV5FDEU . You are receiving this because you were mentioned.Message ID: @.***>
OK I'm glad you solved your issue. We didn't do any experiment modifying pretrained networks. In general I don't expect our understanding of training from scratch carries over unmodified to training from checkpoint (that is trained without muP). So in summary I unfortunately have no advice for your situation. … On Wed, Mar 22, 2023, 2:18 PM FlamingHorizon @.> wrote: @thegregyang https://github.com/thegregyang I found the problem, it was because the default lr_scheduler configuration of my codebase broke the re-group of optimizer. A further question: if I intend to load an existing checkpoint before starting, does it make sense to re-normalize the parameters of matrix-like tensors to be zero-mean? Did you do experiments that deal with such situations (and I saw in your paper that non-gaussian initialization also slows down the process) — Reply to this email directly, view it on GitHub <#40 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMWHHMZMSDVPLBB7WGBBWGDW5JVPFANCNFSM6AAAAAAVV5FDEU . You are receiving this because you were mentioned.Message ID: @.>
I see. Thank you!
I'm trying mup on a deep transformer structure and have the following questions:
Specifically, say if your full training steps should be 1M and you do hp tuning with 0.1M steps, how did you set warmup ratio on both side? Also, I'd like to confirm that with a linear schedule, the final learning rate is down to 0 at the end of each training (i.e. with a sharper decrease for the 0.1M training).
Help would be appreciated!