Closed Zhi0467 closed 1 month ago
Thank you for the valuable report. I highly appreciate your effort in experimenting with grokfast. I will also check with my code, too.
Please gently note that the paper and the provided code is for the proof-of-concept that acceleration of grokking phenomenon is observable in the previously known scenarios of grokking if a specifically modulated gradient filters are applied. However, as you may have noticed, for the practical use of such techniques should be followed by optimal filter design (e.g., window size for MA, alpha & weight decay for EMA, or using other types of LPFs) with minimal tuning effort for the actual practitioners. Your precious report seems to imply that the filter parameters are currently sensitive to the tasks in use. We should have a detailed research for the better filter design for ease-of-use applications.
I will investigate further on this and keep you noticed if you may. Thank you again for the report.
Thank you for the valuable report. I highly appreciate your effort in experimenting with grokfast. I will also check with my code, too.
Please gently note that the paper and the provided code is for the proof-of-concept that acceleration of grokking phenomenon is observable in the previously known scenarios of grokking if a specifically modulated gradient filters are applied. However, as you may have noticed, for the practical use of such techniques should be followed by optimal filter design (e.g., window size for MA, alpha & weight decay for EMA, or using other types of LPFs) with minimal tuning effort for the actual practitioners. Your precious report seems to imply that the filter parameters are currently sensitive to the tasks in use. We should have a detailed research for the better filter design for ease-of-use applications.
I will investigate further on this and keep you noticed if you may. Thank you again for the report.
Thanks and please keep me updated! I forgot to point out that the alpha and lambda I used for experiment 1 - NO4 are 0.99 and 5.0 instead of the suggested 0.98 and 2.0, so as you've mentioned, it's currently sensitive to parameter tuning. As for the second experiment, I used 0.98 and 2.0 but the failure implies that either we need to tune the parameters for each specific task (xy or x^2 + xy + y^2) or that current grokfast struggles with more difficult tasks. Anyway, I'll keep looking into this and hope to see more comprehensive experiments!
Hi! I've been playing around with the code for days and noticed an interesting phenomenon that I hope someone can help me understand: AdamW seems to be better than Grokfast + Adam in many cases. Here are the details of my two experiments:
The results: NO3 shows grokking as expected, but eventually generalizes:
However, NO4 fails:
results: AdamW + small wd shows (not too significant) grokking as expected, but it finally reaches the expected validation acc (predicted in the original OpenAI grokking paper)
Again, Grokfast + Adam + small wd fails to learn the task:
So why is it the case that AdamW seems to be better than Grokfast + Adam in many cases?