microsoft / mup

maximal update parametrization (µP)
https://arxiv.org/abs/2203.03466
MIT License
1.24k stars 88 forks source link

Increasing coord check for the network output #71

Open AkshitaB opened 2 months ago

AkshitaB commented 2 months ago

I'm implementing muP for the OLMo model, and am facing an issue with the coordinate check.

sp_trsfmr_adamw_coord μp_trsfmr_adamw_coord

The increasing l1 is for the network output. Following the docs, I also set readout init and query init to zero. I also ensure that the initialization is applied after set_base_shapes is called.

What other things can I check to debug the issue?

SeunghyunSEO commented 2 weeks ago

hi @AkshitaB , im reproducing MuP too these days. can you share the arch ?? or have you solved the problem?

ofivite commented 2 weeks ago

@AkshitaB (very delayed reply but still might be helpful)

From my experience, I also tried query/readout zero-init and it didn't help. However, what I saw is that while growing at early iterations, the readout norms do stabilise across widths after a sufficient number of iterations (like 30). You might actually already see such hints on your plot for t=4, so maybe running coordinate check for longer steps will flatten your readout norms.

But even if not, it's never been a problem for me in practice to have muTransfer, most importantly is that the other layer norms looks flat, which is the case for you :)