Open madisonmay opened 10 years ago
Would anyone else familiar with pylearn2's AdaDelta
implementation care to comment on this one? I'd like to see this issue addressed if at all possible, but perhaps I'm overlooking something obvious.
The original author is @gdesjardins, maybe he can comment.
I don't know much about AdaDelta, I've just had a quick look at the article, and it looks like they directly compare their epsilon
parameter to the learning rate of other methods (fig. 1), so it may be reasonable to use learning_rate
for that. We should probably add some documentation, though.
If we want to eliminate the learning_rate
from AdaDelta
, I think simply ignoring it would be confusing, and we should instead remove it from SGD
, or at least ignore it (and make it an error to have it) if a LearningRule
is specified, and have the learning_rate
specified in the LearningRule
instead, if it makes sense.
I don't think it's worth the trouble at that point, but if other learning rules come up, I might change my mind.
From the arxiv paper on AdaDelta:
The benefits of this approach are as follows:
• no manual setting of a learning rate.
• insensitive to hyperparameters.
...
The idea behind AdaDelta is that the "method requires no manual tuning of a learning rate." AdaDelta deals sets the learning rate of each weight individually, whereas pylearn2's learning_rate
parameter acts to set a global learning rate. Table 1 in the paper shows that ADAGRAD and other methods are sensitive to hyper parameters, while Table 2 in the paper shows that the value of ADADELTA's epsilon parameter does not have a large effect on model performance.
My concern is not that a global learning_rate
is passed in (that should be fine to treat the learning_rate
as a float that scales each gradient update, although the AdaDelta learning rule is "insensitive to hyperparameters" and will quickly compensate for the scaled updates by reducing the coefficients on the individual weight learning_rate
's), but that it is directly used as the epsilon parameter without mention in the documentation. Whereas for standard SGD or SGD with momentum high learning rates will cause wild oscillations in the validation error over time, a large epsilon for AdaDelta will simply cause the weight updates to approach zero more quickly, so it would be quite easy to become confused when manual modification of the learning_rate
parameter does not have the expected effect on a model.
If we want to eliminate the learning_rate from AdaDelta, I think simply ignoring it would be confusing, and we should instead remove it from SGD, or at least ignore it (and make it an error to have it) if a LearningRule is specified, and have the learning_rate specified in the LearningRule instead, if it makes sense.
I would support moving learning_rate
to being a parameter of LearningRule
, but that would obviously break a lot of existing code so it's perhaps not the best route forward. A simpler approach might be to add in an optional epsilon
parameter to AdaDelta that defaults to something reasonable like 1e6
, scale each weight's calculated learning rate by the global learning_rate
specified by the argument to SGD, and update the docs to be much more explicit about this behavior. I would also be fine with raising an error / warning when a learning_rate
is set and the AdaDelta learning rule is applied.
I'd love to hear @gdesjardins's opinion, though. My apologies for the wall of text.
I was also confused by the fact that learning rate becomes epsilon
.
Same here, I spent some time trying to increase learning rate wildly just to observe any effect ;)
The learning rate parameter is used in a very unintuitive way in the implementation of the
AdaDelta
learning rule -- it's scaled by thelr_scalers
and then fed in as the epsilon parameter described in AdaDelta: An Adaptive Learning Rate Method. This is not at all transparent and could lead to confusion when usinglr_scalers
or manuallearning_rate
modification does not lead to the expected result.Is it worth leaving
learning_rate
as an coefficient of the update term and instead asking for a separateepsilon
parameter? I know the goal ofAdaDelta
is to eliminate the need for alearning_rate
parameter / other sensitive hyperparameters (meaning that a learning rate of 1 is a good choice), but I think it's probably cleaner to keep the usage of the SGDlearning_rate
parameter consistent across learning rules and to be explicit about the use of a different type of hyperparameter inAdaDelta
. Thoughts?Relevant code here.