microsoft / mup

maximal update parametrization (µP)
https://arxiv.org/abs/2203.03466
MIT License
1.24k stars 88 forks source link

Embedding Multiplier for Transformer - Clarification #45

Closed sashaDoubov closed 1 year ago

sashaDoubov commented 1 year ago

Hi!

I really love the work in muP! I wanted to clarify one point from the GPT-3 sweep in Appendix F.4:

image

Does this multiplier correspond to scaling the output of the input embedding layer: ie. inputs_embeds = self.embed_multiplier * self.wte(input_ids)?

I didn't notice this hyperparameter being set in either mutransformers or the Transformer example in this repo, but saw that Appendix F.4 recommended a value of 10 for this param, (and that it seems different from the output temperature of the unembeddings layer).

Thanks.

thegregyang commented 1 year ago

Thanks for the kind words. Yes that’s correct. We didn’t implement everything in those repos so the code is simple and easy to understand. But as you can see, implementing this multiplier is very easy.

From: Sasha Doubov @.> Date: Thursday, April 27, 2023 at 2:36 PM To: microsoft/mup @.> Cc: Subscribed @.***> Subject: [microsoft/mup] Embedding Multiplier for Transformer - Clarification (Issue #45)

Hi!

I really love the work in muP! I wanted to clarify one point from the GPT-3 sweep in Appendix F.4: [image]https://user-images.githubusercontent.com/4717972/234995173-19a9a8fb-4387-41f3-96b1-e892da40f1b0.png Does this multiplier correspond to scaling the output of the input embedding layer: ie. inputs_embeds = self.embed_multiplier * self.wte(input_ids)?

I didn't notice this hyperparameter being set in either mutransformers or the Transformer example in this repo, but saw that Appendix F.4 recommended a value of 10 for this param, (and that it seems different from the output temperature of the unembeddings layer).

Thanks.

— Reply to this email directly, view it on GitHubhttps://github.com/microsoft/mup/issues/45, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMWHHMZO73SNHQV7QDVHS23XDLRHLANCNFSM6AAAAAAXONNDCQ. You are receiving this because you are subscribed to this thread.Message ID: @.***>

sashaDoubov commented 1 year ago

Thanks for the fast answer! That makes sense.