ironjr / grokfast

Official repository for the paper "Grokfast: Accelerated Grokking by Amplifying Slow Gradients"
https://arxiv.org/abs/2405.20233
MIT License
476 stars 39 forks source link

Is this specific to transformers? #11

Closed phalexo closed 1 month ago

phalexo commented 2 months ago

I think the original article first discovered the grokking effect in transformers.

I have been experimenting with a seq2seq model, for language translation, and not seeing any behavior that would indicate any state transition on validation data.

Zhi0467 commented 1 month ago

I've tested it on a two-layer diagonal MLP for classification that exhibits grokking and grokfast mitigates grokking.

ironjr commented 1 month ago

Thank you very much for trying out our code. I would like to gently note that the code provided here is basically for the proof-of-concept demonstration of acceleration of grokking in some previously known scenarios, and therefore the filter design can be suboptimal in other types of scenarios.

As mentioned in our paper, Transformers, MLPs and LSTMs under grokking phenomenon can be benefitted by the use of grokfast if a well-designed low-pass filter is applied. However, as you may have already noticed, in MLPs and LSTMs, there should be high weight norm regularization in order to have such a benefit. So, basically, weight norms (as well-investigated in Omnigrok paper) and low-pass filtered gradients seem to provide synergistic effect if used together. Such effect should be further investigated.

This is a very early stage of designing a good optimizer for the models under grokking, and so I believe that there should be better filter designs to work with different types of tasks/models. In other words, MA/EMA filters shown here are only for the proof-of-concept and there should be further research of finding good filter designs (except for the simplest MA/EMA filters) for our benefit. Again, thanks for the valuable report.