Adaptive learning-rate optimization

nestordemeure commented 1 year ago

Thank you for that implementation!

There was an older optimizer[0] that would update the learning-rate according to the following rule (basically increasing the learning-rate when things are going in the same direction and decreasing it when they are not):

epsilon = 0.01
same_sign = sign(update[weight]) == sign(previous_update[weight])
lr[weight] = torch.where(same_sign, lr[weight]*(1+epsilon), lr[weight]/(1+epsilon))

It requires:

keeping track of the previous sign of the update vector,
storing one learning rate per weight,
adding an epsilon parameter.

However, at that price, you can find an epsilon and starting learning rate that work well for a large range of problems and not have to think about learning rate scheduling nor optimal learning-rate for your given problem.

Given the regularity of the update step of lion, it might be worth playing with.

[0]: I just searched for an exact reference but could not find it anymore, it was before Adam's domination over the ML world.

lucidrains commented 1 year ago

@nestordemeure hey Nestor, thanks for the suggestion

i feel like adding anything more is just a slippery slope into the demise of so many optimizers of the past. i think we should just collect more feedback on people's experience with Lion. if it is not uniformly good, probably best to just stick with Adam and look forward to learned approaches. kind of an indictment on all these neural architecture search papers too, if it surfaces that Lion does not work as well as advertised

lucidrains commented 1 year ago

@nestordemeure that said, are you seeing a signal when training with Lion?

lucidrains / lion-pytorch

Adaptive learning-rate optimization #9