mila-iqia / blocks

A Theano framework for building and training neural networks
Other
1.16k stars 352 forks source link

Momentum step rule implementation is counter-intuitive #240

Closed vdumoulin closed 9 years ago

vdumoulin commented 9 years ago

Following a conversation with @bartvm I now understand why Momentum is implemented the way it is (it's meant to be chained with SteepestDescent, so that Momentum's self.momentum * velocity + gradient translates to something like momentum * velocity - learning_rate * gradient when used in conjunction with SteepestDescent).

It has the unfortunate effect of obfuscating the implementation (which can be solved by proper documentation), and it also requires the user to be very careful about how things are chained:

On the other hand, there may be some advantages to having Momentum be composable with other step rules.

My proposition would be to rename Momentum to make it clear that it's not intended to be used alone and to wrap SteepestDescent + Momentum in a Momentum step rule which chains them together.

rizar commented 9 years ago

I see your point, but am not sure if I agree with it. In fact the full name of the algorithm is "gradient descent with momentum", which means that having momentum as an additional component makes sense. In PyLearn2 that was a stand-alone learning rule and maybe that may cause confusions, but so far I am inclined to think that Blocks way is better. I can put a note in the Momentum documentation explaining how to add a learning rate.

I agree that it is weird that Momentum can not be used alone. I think the reason is that multiplication by minus one should be moved out to the GradientDescent class, and #243 does it. After #243 using Momentum alone would imply learning rate equal to one.

ADADELTA is sometimes used with gradient rescaling, @janchorowski did it when working with TIMIT. Maybe using Adadelta and SteepestDescent in a chain looks weird, then we can add a clone of SteepestDescent called Rescale that reads better.

vdumoulin commented 9 years ago

I wouldn't revert back to having monolithic step rules either, but I think that having a set of pre-baked combinations also makes sense.

Here's an example. In order to get the correct behaviour for SteepestDescent + Momentum, SteepestDescent needs to come first and Momentum comes after. In implementing RMSProp (see #246 ) the way Momentum is implemented, I realized that the correct chaining order in that case is to use RMSProp first, and then to use SteepestDescent.

I'm probably not the only one who's bound to make these types of mistakes, which is why I implemented two classes for RMSProp: RMSPropChainable is analogous to Momentum, and RMSProp is a subclass of CompositeRule which chains RMSPropChainable and SteepestDescent in the correct order.

rizar commented 9 years ago

In order to get the correct behaviour for SteepestDescent + Momentum, SteepestDescent needs to come first and Momentum comes after

Maybe I am not understanding something easy: why GradientDescent can not go after Momentum?

vdumoulin commented 9 years ago

SteepestDescent and Momentum are not commutative with respect to their ordering in CompositeRule:

rizar commented 9 years ago

Yep, but the velocity term in the second expression is different from the momentum term in the first one, because we accumulate gradient, not gradient scaled by the learning rate. I seems to me that those two step rules are commutative, more exactly, they will be after #243 (please comment on that one!)

But if so many people wish to have an all-batteries-included Momentum, then I do not mind. I would only call not MomentumChainable, but instead JustMomentum or MomentumPure or smth like that, because technically absolutely all step rules are chainable.

rizar commented 9 years ago

Fixed by #266