Closed TKassis closed 2 months ago
Thanks for your recognition!
From my experiences, I believe the rule of thumb can be summarized as:
However this is just a recommendation, not based on a rigorous study (which may be a great topic for future works). Hope you find this useful!
Do you have a take on specific learning rates as proposed in Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization ?
I stuck to the default learning rate of each task (1e-3~1e-4) performed in the paper. I empirically find out that the maximum stable learning rate of the Grokfast-augmented optimizers and non-Grofkast baseline optimizers are similar.
However, if you want to venture further, since Grokfast adds smoothened gradients, it tends to stabilize the training slightly. So, you may increase the learning rate up to X1.5~X2.0 of the baseline models (as I have tested).
Great work! First study I see that focuses on making the phenomenon practical.
What is a good strategy for choosing the AdamW weight decay value for a new model and dataset? In the paper it seems there is a very large range of values used. Is there an approach you use to narrow down possible values that work instead of doing full hyperparameter turning (which is costly if one were to wait for Grokking to happen).