Closed windspirit95 closed 2 years ago
Hi @windspirit95, thanks for your interest!
As a general rule of thumb, you only need to replace nn.Linear
that goes from the model's hidden representation space to the "label" space. These layers are typically the last ones you apply before apply the loss function. Usually, this is easy to tell from the model's forward
function. However, here, from a cursory read of the model __init__
, it seems that the nn.Linear
that maps from *_embed_dim
to final_dim
are the ones that need to replace by mup.MuReadout
: self.final_proj
and self.project_q
. Though here there are a few lines that seem to mix hidden dimension with output dimension, e.g.
final_dim = cfg.final_dim if cfg.final_dim > 0 else cfg.encoder_embed_dim
and it's not clear to me why this is done.
Hi @thegregyang. Thanks for your reply. The example I took from: https://github.com/pytorch/fairseq/blob/main/fairseq/models/wav2vec/wav2vec2.py, from line 294 Seem this model looks a bit like a transformer example, isn't it? But it quite difficult to apply muP in this model :)
Looking at the source code, it seems that my initial assessment is correct. You just need to replace self.final_proj
and self.project_q
. You can double check this by running coord check and ensuring that the curves are flat.
I'll close this issue for now since it seems like we have arrived at an answer. But feel free to re-open if there are further problems.
Hi, Your project is really interesting, so I am learning how to apply it to some specific models. For example, the model has multiple nn.Linear layers like in wav2vec 2.0 (self.post_extract_proj, self.project_q, self.project_inp, self.target_glu, self.final_proj), should I replace all these layers to MuReadout?
Thank you! ^^