Open inspirit opened 3 months ago
hi Eugene and thanks for reporting this!
did you by chance compare it with Adam but with half the learning rate? I don't think it is a fair comparison as they effectively doubled their learning rate by summing the slow grads
nope i did not change any parameters of Adam opt, I just added ema grads step between loss.backward() and opt.step() as mentioned in the paper, so its indeed not fair comparison :)
how big was the speed up you saw?
25-30% faster
yeah that isn't significant for double the lr, imo
Ill try it for some other tasks other than modulus addition
actually, it is more like 3x the lr given the lamb
they use is 2
yup, thats rights
Looks like i was celebrating too early, comparing now with Lion: after some trials searching for correct LR i can see that grokfast better right from the start for about 3 epochs afterwards lion catches up and i'm not correcting LR for grokfast, so it virtually has 3x lr
@inspirit thanks! was the comparison with lion + grokfast with lion?
what kind of task are you training on?
it was grokfast+adam vs lion, i have a recurrent generator net with several conditioning encoders for audio and text. what i like about grokfast is that its quite stable with almost any optimizer configuration, while lion you need to tune to make it work
@inspirit yea, you should control for the same optimizer at least
but it is safe to say at this point that grokfast doesn't seem to produce any immediate improvements for practical tasks out of the box. i'm still interested in these tiny algorithmic tasks though; i'm wondering if one can curriculum grok a series of N tasks
it seems to be very tricky to make it work with Lion opt, did not find LR combination that make it learn with grokfast
Hi Phil, I tested it in my private project 2 days ago, and it seems to speed up learning quite significantly, not sure that final val/train losses are better, more like very similar to original but it got there much faster. Also i did not do different tasks/architects to compare, but my project contains few different nets one including tiny transformer, another using RNN cells and last one simple shallow convolutions