ironjr / grokfast

Official repository for the paper "Grokfast: Accelerated Grokking by Amplifying Slow Gradients"
https://arxiv.org/abs/2405.20233
MIT License
476 stars 39 forks source link

Choosing weight decay? #1

Closed TKassis closed 2 months ago

TKassis commented 2 months ago

Great work! First study I see that focuses on making the phenomenon practical.

What is a good strategy for choosing the AdamW weight decay value for a new model and dataset? In the paper it seems there is a very large range of values used. Is there an approach you use to narrow down possible values that work instead of doing full hyperparameter turning (which is costly if one were to wait for Grokking to happen).

ironjr commented 2 months ago

Thanks for your recognition!

From my experiences, I believe the rule of thumb can be summarized as:

  1. Start from the default weight decay of that task. For example, the value chosen by the most widely used Github repository of that task.
  2. Fix the weight decay and try to find the optimal setting for the Grokfast filter parameters (momentum, window size, and amplitude) first. Although weight decay do affect the values of the optimal filter parameters, its effect seems to be insignificant in my experiences.
  3. Start increasing the weight decay value. Start from X1 then try (X2, X5, X10). I couldn't get better results with X100 scale of the default value.

However this is just a recommendation, not based on a rigorous study (which may be a great topic for future works). Hope you find this useful!

l4b4r4b4b4 commented 2 months ago

Do you have a take on specific learning rates as proposed in Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization ?

ironjr commented 2 months ago

I stuck to the default learning rate of each task (1e-3~1e-4) performed in the paper. I empirically find out that the maximum stable learning rate of the Grokfast-augmented optimizers and non-Grofkast baseline optimizers are similar.

However, if you want to venture further, since Grokfast adds smoothened gradients, it tends to stabilize the training slightly. So, you may increase the learning rate up to X1.5~X2.0 of the baseline models (as I have tested).