Closed dreavjr closed 8 months ago
Okay, I think I finally got it!
I cannot simply apply mup to the individual parameters of a vanilla model/layer/block and expect it to work every time -> sometimes the model/layer/block has to be reparameterized. In particular, all layers in an mlp-like block have to grow or shrink in tandem, except, possibly by the output layer of the model.
I am closing this for now.
My code is triggering the "has infinite fan-in and finite fan-out dimensions but is not type
MuReadout
" assertion on "non-obvious" situations (not the last linear layer of the model):What am I doing wrong? Is there a good way to debug those situations?