Open AkshitaB opened 2 months ago
hi @AkshitaB , im reproducing MuP too these days. can you share the arch ?? or have you solved the problem?
@AkshitaB (very delayed reply but still might be helpful)
From my experience, I also tried query/readout zero-init and it didn't help. However, what I saw is that while growing at early iterations, the readout norms do stabilise across widths after a sufficient number of iterations (like 30). You might actually already see such hints on your plot for t=4, so maybe running coordinate check for longer steps will flatten your readout norms.
But even if not, it's never been a problem for me in practice to have muTransfer, most importantly is that the other layer norms looks flat, which is the case for you :)
I'm implementing muP for the OLMo model, and am facing an issue with the coordinate check.
The increasing l1 is for the network output. Following the docs, I also set readout init and query init to zero. I also ensure that the initialization is applied after
set_base_shapes
is called.What other things can I check to debug the issue?