Open codedecde opened 1 year ago
Position embedding maps to an infinite dimension (config.hidden_size). Why do you say its finite?
Yes routing gate should be MuReadout.
On Thu, Jun 1, 2023, 12:41 AM Barun Patra @.***> wrote:
Duplicate of question asked on the mutransformers repository (link https://github.com/microsoft/mutransformers/issues/3#issue-1733058335)
Hi ! I was wondering if (learned) positional embeddings should be MuReadout layers, since they map to a finite dimensional space. Specifically
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
In addition to that, did you try using muP for sparse MoE models ? Am curious about any findings for those. Specifically, I was wondering if the routing gate (hdim, num_experts) would also be a MuReadout layer (if we don't scale the number of experts).
Would be grateful for any advice :)
Thank you !
— Reply to this email directly, view it on GitHub https://github.com/microsoft/mup/issues/48, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMWHHM6LLOILTYMSMQ46TP3XI7CLLANCNFSM6AAAAAAYWDHITI . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thank you ! I meant that the sequence length aspect is finite (similar to vocab size) ?
Duplicate of question asked on the mutransformers repository (link)
Hi ! I was wondering if (learned) positional embeddings should be MuReadout layers, since they map to a finite dimensional space. Specifically
https://github.com/microsoft/mutransformers/blob/480287ce7b18a07a3432e8f2fbc0f0e5b71e2599/mutransformers/models/bert/modeling_bert.py#L174
In addition to that, did you try using muP for sparse MoE models ? Am curious about any findings for those. Specifically, I was wondering if the routing gate (hdim, num_experts) would also be a MuReadout layer (if we don't scale the number of experts).
Would be grateful for any advice :)
Thank you !