Closed vdumoulin closed 9 years ago
I see your point, but am not sure if I agree with it. In fact the full name of the algorithm is "gradient descent with momentum", which means that having momentum as an additional component makes sense. In PyLearn2 that was a stand-alone learning rule and maybe that may cause confusions, but so far I am inclined to think that Blocks way is better. I can put a note in the Momentum
documentation explaining how to add a learning rate.
I agree that it is weird that Momentum
can not be used alone. I think the reason is that multiplication by minus one should be moved out to the GradientDescent
class, and #243 does it. After #243 using Momentum
alone would imply learning rate equal to one.
ADADELTA is sometimes used with gradient rescaling, @janchorowski did it when working with TIMIT. Maybe using Adadelta
and SteepestDescent
in a chain looks weird, then we can add a clone of SteepestDescent
called Rescale
that reads better.
I wouldn't revert back to having monolithic step rules either, but I think that having a set of pre-baked combinations also makes sense.
Here's an example. In order to get the correct behaviour for SteepestDescent + Momentum
, SteepestDescent
needs to come first and Momentum
comes after. In implementing RMSProp (see #246 ) the way Momentum
is implemented, I realized that the correct chaining order in that case is to use RMSProp first, and then to use SteepestDescent
.
I'm probably not the only one who's bound to make these types of mistakes, which is why I implemented two classes for RMSProp: RMSPropChainable
is analogous to Momentum
, and RMSProp
is a subclass of CompositeRule
which chains RMSPropChainable
and SteepestDescent
in the correct order.
In order to get the correct behaviour for SteepestDescent + Momentum, SteepestDescent needs to come first and Momentum comes after
Maybe I am not understanding something easy: why GradientDescent
can not go after Momentum
?
SteepestDescent
and Momentum
are not commutative with respect to their ordering in CompositeRule
:
-learning_rate * gradient
so the final step looks like momentum * velocity - learning_rate * gradient
.momentum * velocity + gradient
so the final step looks like -learning_rate * (momentum * velocity + gradient)
.Yep, but the velocity
term in the second expression is different from the momentum term in the first one, because we accumulate gradient, not gradient scaled by the learning rate. I seems to me that those two step rules are commutative, more exactly, they will be after #243 (please comment on that one!)
But if so many people wish to have an all-batteries-included Momentum
, then I do not mind. I would only call not MomentumChainable
, but instead JustMomentum
or MomentumPure
or smth like that, because technically absolutely all step rules are chainable.
Fixed by #266
Following a conversation with @bartvm I now understand why
Momentum
is implemented the way it is (it's meant to be chained withSteepestDescent
, so thatMomentum
'sself.momentum * velocity + gradient
translates to something likemomentum * velocity - learning_rate * gradient
when used in conjunction withSteepestDescent
).It has the unfortunate effect of obfuscating the implementation (which can be solved by proper documentation), and it also requires the user to be very careful about how things are chained:
Momentum
cannot be used alone, or it would lead to counter-intuitive results.SteepestDescent
doesn't make much sense.On the other hand, there may be some advantages to having
Momentum
be composable with other step rules.My proposition would be to rename
Momentum
to make it clear that it's not intended to be used alone and to wrapSteepestDescent + Momentum
in aMomentum
step rule which chains them together.