Questions on learning schedule and binary classification

FlamingHorizon commented 1 year ago

I'm trying mup on a deep transformer structure and have the following questions:

Warmup Ratio I tuned the learning rate alone with width=256 and transferred the result to width=512, but the training curve diverges. I coord_checked and there seems to be no problem. Then I modified the warm_up ratio from 0.01 to 0.1 for width=512 and the loss converges. I was wondering if warmup ratio is a mu-transferable hyperparameter that must be tuned, or it can be mu-transferred across (seems not according to my experiments above?) Do you have theoretical insights or experiences on this?

Specifically, say if your full training steps should be 1M and you do hp tuning with 0.1M steps, how did you set warmup ratio on both side? Also, I'd like to confirm that with a linear schedule, the final learning rate is down to 0 at the end of each training (i.e. with a sharper decrease for the 0.1M training).

Binary Classification If I have a binary classification head (linear + softmax) trained along with the language model, and follow the practice of all-zero initialization for this head, the weights and logits of the two output neurons will be always "x and -x" across the training due to the gradient property of binary softmax. Although I'm not sure if this actually has bad effects, I was wondering is all-zero initialization necessary for a classification head in transfromer pretraining?

Help would be appreciated!

edwardjhu commented 1 year ago

Hi!

This sounds like a precision issue or a bug in your code. You shouldn't have to adjust the warmup ratio if you are training for the same number of steps. Are the batches ordered in the same way in these runs?
It doesn't matter since you can still model an arbitrary probability. Zero init can be helpful since it gets rid of the effect of the initial GP.

Hope this helps!

FlamingHorizon commented 1 year ago

Thanks for your reply!

Yes, the batches are ordered in the same way. The models are trained in full float32. A "silly" question: did you mean that if a model converges in small width, it is guaranteed to converge in larger width if mu-transferred correctly?

thegregyang commented 1 year ago

Yes, the larger model should never do worse than the small model in muP, especially if the entire training procedure (number of steps, etc) is fixed.

From: FlamingHorizon @.> Date: Tuesday, March 14, 2023 at 9:04 PM To: microsoft/mup @.> Cc: Subscribed @.***> Subject: Re: [microsoft/mup] Questions on learning schedule and binary classification (Issue #40)

Thanks for your reply!

Yes, the batches are ordered in the same way. The models are trained in full float32. A "silly" question: did you mean that if a model converges in small width, it is guaranteed to converge in larger width if mu-transferred correctly?

— Reply to this email directly, view it on GitHubhttps://github.com/microsoft/mup/issues/40#issuecomment-1469183659, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMWHHM5F3TJ3RSWFGN5TCALW4EPT5ANCNFSM6AAAAAAVV5FDEU. You are receiving this because you are subscribed to this thread.Message ID: @.***>

FlamingHorizon commented 1 year ago

Yes, the larger model should never do worse than the small model in muP, especially if the entire training procedure (number of steps, etc) is fixed. From: FlamingHorizon @.> Date: Tuesday, March 14, 2023 at 9:04 PM To: microsoft/mup @.> Cc: Subscribed @.> Subject: Re: [microsoft/mup] Questions on learning schedule and binary classification (Issue #40) Thanks for your reply! Yes, the batches are ordered in the same way. The models are trained in full float32. A "silly" question: did you mean that if a model converges in small width, it is guaranteed to converge in larger width if mu-transferred correctly? — Reply to this email directly, view it on GitHub<#40 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMWHHM5F3TJ3RSWFGN5TCALW4EPT5ANCNFSM6AAAAAAVV5FDEU. You are receiving this because you are subscribed to this thread.Message ID: @.>

I see, thanks. The following is some results I got yesterday for training loss @ 10k steps. I was tuning lr + output_mult + initializer_range. It seems that "the wider, the better" holds true if the model does not diverge.

I'll have a double check on my code, but please also let me know if anything looks wired in this table!

thegregyang commented 1 year ago

Are you using low precision?

On Tue, Mar 14, 2023, 10:42 PM FlamingHorizon @.***> wrote:

Yes, the larger model should never do worse than the small model in muP, especially if the entire training procedure (number of steps, etc) is fixed. From: FlamingHorizon @.> Date: Tuesday, March 14, 2023 at 9:04 PM To: microsoft/mup @.> Cc: Subscribed @.> Subject: Re: [microsoft/mup] Questions on learning schedule and binary classification (Issue #40 https://github.com/microsoft/mup/issues/40) Thanks for your reply! Yes, the batches are ordered in the same way. The models are trained in full float32. A "silly" question: did you mean that if a model converges in small width, it is guaranteed to converge in larger width if mu-transferred correctly? — Reply to this email directly, view it on GitHub<#40 (comment) https://github.com/microsoft/mup/issues/40#issuecomment-1469183659>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMWHHM5F3TJ3RSWFGN5TCALW4EPT5ANCNFSM6AAAAAAVV5FDEU https://github.com/notifications/unsubscribe-auth/AMWHHM5F3TJ3RSWFGN5TCALW4EPT5ANCNFSM6AAAAAAVV5FDEU. You are receiving this because you are subscribed to this thread.Message ID: @.>

I see, thanks. The following is some results I got yesterday for training loss @ 10k steps. I was tuning lr + output_mult + initializer_range. It seems that "the wider, the better" holds true if the model does not diverge.

I'll have a double check on the code, but please also let me know if something looks wired in this table!

[image: image] https://user-images.githubusercontent.com/16420121/225200186-1cb46452-4edd-4be4-898b-960c932ab502.png

— Reply to this email directly, view it on GitHub https://github.com/microsoft/mup/issues/40#issuecomment-1469271241, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMWHHM4VKEJ7C7DUUXARWATW4E3BLANCNFSM6AAAAAAVV5FDEU . You are receiving this because you commented.Message ID: @.***>

FlamingHorizon commented 1 year ago

No, the model was trained in float32

thegregyang commented 1 year ago

Is this a pre layernorm transformer?

On Tue, Mar 14, 2023, 10:45 PM FlamingHorizon @.***> wrote:

No, the model was trained in float32

— Reply to this email directly, view it on GitHub https://github.com/microsoft/mup/issues/40#issuecomment-1469273323, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMWHHMZQDNZ7X3UGSVNMLQLW4E3MLANCNFSM6AAAAAAVV5FDEU . You are receiving this because you commented.Message ID: @.***>

FlamingHorizon commented 1 year ago

yes, it was exactly a gpt-2 with large depth, and batch_size = 24 (too small?)

thegregyang commented 1 year ago

@TevenLeScao has been working on GPT-2 with Megatron-deepspeed as well. @TevenLeScao does anything here ring any bells for you for what the issue could be?

FlamingHorizon commented 1 year ago

@thegregyang I found the problem, it was because the default lr_scheduler configuration of my codebase broke the re-group of optimizer.

A further question: if I intend to load an existing checkpoint before starting, does it make sense to re-normalize the parameters of matrix-like tensors to be zero-mean? Did you do experiments that deal with such situations (and I saw in your paper that non-gaussian initialization also slows down the process)

thegregyang commented 1 year ago

OK I'm glad you solved your issue.

We didn't do any experiment modifying pretrained networks. In general I don't expect our understanding of training from scratch carries over unmodified to training from checkpoint (that is trained without muP). So in summary I unfortunately have no advice for your situation.

On Wed, Mar 22, 2023, 2:18 PM FlamingHorizon @.***> wrote:

@thegregyang https://github.com/thegregyang I found the problem, it was because the default lr_scheduler configuration of my codebase broke the re-group of optimizer.

A further question: if I intend to load an existing checkpoint before starting, does it make sense to re-normalize the parameters of matrix-like tensors to be zero-mean? Did you do experiments that deal with such situations (and I saw in your paper that non-gaussian initialization also slows down the process)

— Reply to this email directly, view it on GitHub https://github.com/microsoft/mup/issues/40#issuecomment-1478865217, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMWHHMZMSDVPLBB7WGBBWGDW5JVPFANCNFSM6AAAAAAVV5FDEU . You are receiving this because you were mentioned.Message ID: @.***>

FlamingHorizon commented 1 year ago

OK I'm glad you solved your issue. We didn't do any experiment modifying pretrained networks. In general I don't expect our understanding of training from scratch carries over unmodified to training from checkpoint (that is trained without muP). So in summary I unfortunately have no advice for your situation. … On Wed, Mar 22, 2023, 2:18 PM FlamingHorizon @.> wrote: @thegregyang https://github.com/thegregyang I found the problem, it was because the default lr_scheduler configuration of my codebase broke the re-group of optimizer. A further question: if I intend to load an existing checkpoint before starting, does it make sense to re-normalize the parameters of matrix-like tensors to be zero-mean? Did you do experiments that deal with such situations (and I saw in your paper that non-gaussian initialization also slows down the process) — Reply to this email directly, view it on GitHub <#40 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMWHHMZMSDVPLBB7WGBBWGDW5JVPFANCNFSM6AAAAAAVV5FDEU . You are receiving this because you were mentioned.Message ID: @.>

I see. Thank you!

microsoft / mup

Questions on learning schedule and binary classification #40