lucidrains / grokfast-pytorch

Explorations into the proposal from the paper "Grokfast, Accelerated Grokking by Amplifying Slow Gradients"
MIT License
82 stars 4 forks source link

Seems to work for me #1

Open inspirit opened 3 months ago

inspirit commented 3 months ago

Hi Phil, I tested it in my private project 2 days ago, and it seems to speed up learning quite significantly, not sure that final val/train losses are better, more like very similar to original but it got there much faster. Also i did not do different tasks/architects to compare, but my project contains few different nets one including tiny transformer, another using RNN cells and last one simple shallow convolutions

lucidrains commented 3 months ago

hi Eugene and thanks for reporting this!

did you by chance compare it with Adam but with half the learning rate? I don't think it is a fair comparison as they effectively doubled their learning rate by summing the slow grads

inspirit commented 3 months ago

nope i did not change any parameters of Adam opt, I just added ema grads step between loss.backward() and opt.step() as mentioned in the paper, so its indeed not fair comparison :)

lucidrains commented 3 months ago

how big was the speed up you saw?

inspirit commented 3 months ago

25-30% faster

lucidrains commented 3 months ago

yeah that isn't significant for double the lr, imo

lucidrains commented 3 months ago

Ill try it for some other tasks other than modulus addition

lucidrains commented 3 months ago

actually, it is more like 3x the lr given the lamb they use is 2

inspirit commented 3 months ago

yup, thats rights

inspirit commented 3 months ago

Looks like i was celebrating too early, comparing now with Lion: after some trials searching for correct LR i can see that grokfast better right from the start for about 3 epochs afterwards lion catches up and i'm not correcting LR for grokfast, so it virtually has 3x lr

lucidrains commented 3 months ago

@inspirit thanks! was the comparison with lion + grokfast with lion?

what kind of task are you training on?

inspirit commented 3 months ago

it was grokfast+adam vs lion, i have a recurrent generator net with several conditioning encoders for audio and text. what i like about grokfast is that its quite stable with almost any optimizer configuration, while lion you need to tune to make it work

lucidrains commented 3 months ago

@inspirit yea, you should control for the same optimizer at least

but it is safe to say at this point that grokfast doesn't seem to produce any immediate improvements for practical tasks out of the box. i'm still interested in these tiny algorithmic tasks though; i'm wondering if one can curriculum grok a series of N tasks

inspirit commented 3 months ago

it seems to be very tricky to make it work with Lion opt, did not find LR combination that make it learn with grokfast