lucidrains / grokfast-pytorch

Explorations into the proposal from the paper "Grokfast, Accelerated Grokking by Amplifying Slow Gradients"
MIT License
67 stars 3 forks source link

Different math? #4

Open brockbrownwork opened 6 days ago

brockbrownwork commented 6 days ago

I'm seeing a pretty significant difference between the loss plots of this implementation and the official implementation here, this one has trouble converging in my use case (though I did not run it for very long). It may be possible that the math is different.

lucidrains commented 5 days ago

@brockbrownwork it could be because i normalized the learning rate in this repository, for a fair comparison https://github.com/lucidrains/grokfast-pytorch/blob/main/grokfast_pytorch/grokfast.py#L26

lucidrains commented 5 days ago

@brockbrownwork try setting normalize_lr to False and it should be equivalent, i hope