ironjr / grokfast

Official repository for the paper "Grokfast: Accelerated Grokking by Amplifying Slow Gradients"
https://arxiv.org/abs/2405.20233
MIT License
476 stars 39 forks source link

AdamW better than grokfast + Adam? #12

Closed Zhi0467 closed 1 month ago

Zhi0467 commented 1 month ago

Hi! I've been playing around with the code for days and noticed an interesting phenomenon that I hope someone can help me understand: AdamW seems to be better than Grokfast + Adam in many cases. Here are the details of my two experiments:

  1. I experimented with main.py on the following two setups: NO3 being AdamW with a small wd 0.01 and NO4 being ema grokfast with Adam and the same small wd. Screenshot 2024-07-10 at 2 58 37 PM

The results: NO3 shows grokking as expected, but eventually generalizes: acc_NO3_none_wd10e-02_lrx4_optimizerAdamW_start_at1

However, NO4 fails: acc_NO4_ema_a0990_l5_wd10e-02_lrx4_optimizerAdam_start_at1

  1. This time I changed the task to be learning x^2 + xy + y^2 instead of simple multiplication, and changed p from 97 to 113. Here are the setups: Screenshot 2024-07-10 at 3 04 01 PM
Screenshot 2024-07-10 at 3 04 20 PM

results: AdamW + small wd shows (not too significant) grokking as expected, but it finally reaches the expected validation acc (predicted in the original OpenAI grokking paper) acc_test_none_wd50e-03_lrx4

Again, Grokfast + Adam + small wd fails to learn the task: acc_test_ema_a0980_l2_wd50e-03_lrx4

So why is it the case that AdamW seems to be better than Grokfast + Adam in many cases?

ironjr commented 1 month ago

Thank you for the valuable report. I highly appreciate your effort in experimenting with grokfast. I will also check with my code, too.

Please gently note that the paper and the provided code is for the proof-of-concept that acceleration of grokking phenomenon is observable in the previously known scenarios of grokking if a specifically modulated gradient filters are applied. However, as you may have noticed, for the practical use of such techniques should be followed by optimal filter design (e.g., window size for MA, alpha & weight decay for EMA, or using other types of LPFs) with minimal tuning effort for the actual practitioners. Your precious report seems to imply that the filter parameters are currently sensitive to the tasks in use. We should have a detailed research for the better filter design for ease-of-use applications.

I will investigate further on this and keep you noticed if you may. Thank you again for the report.

Zhi0467 commented 1 month ago

Thank you for the valuable report. I highly appreciate your effort in experimenting with grokfast. I will also check with my code, too.

Please gently note that the paper and the provided code is for the proof-of-concept that acceleration of grokking phenomenon is observable in the previously known scenarios of grokking if a specifically modulated gradient filters are applied. However, as you may have noticed, for the practical use of such techniques should be followed by optimal filter design (e.g., window size for MA, alpha & weight decay for EMA, or using other types of LPFs) with minimal tuning effort for the actual practitioners. Your precious report seems to imply that the filter parameters are currently sensitive to the tasks in use. We should have a detailed research for the better filter design for ease-of-use applications.

I will investigate further on this and keep you noticed if you may. Thank you again for the report.

Thanks and please keep me updated! I forgot to point out that the alpha and lambda I used for experiment 1 - NO4 are 0.99 and 5.0 instead of the suggested 0.98 and 2.0, so as you've mentioned, it's currently sensitive to parameter tuning. As for the second experiment, I used 0.98 and 2.0 but the failure implies that either we need to tune the parameters for each specific task (xy or x^2 + xy + y^2) or that current grokfast struggles with more difficult tasks. Anyway, I'll keep looking into this and hope to see more comprehensive experiments!