Treatment of Learning Rate in AdaDelta

madisonmay commented 10 years ago

The learning rate parameter is used in a very unintuitive way in the implementation of the AdaDelta learning rule -- it's scaled by the lr_scalers and then fed in as the epsilon parameter described in AdaDelta: An Adaptive Learning Rate Method. This is not at all transparent and could lead to confusion when using lr_scalers or manual learning_rate modification does not lead to the expected result.

Is it worth leaving learning_rate as an coefficient of the update term and instead asking for a separate epsilon parameter? I know the goal of AdaDelta is to eliminate the need for a learning_rate parameter / other sensitive hyperparameters (meaning that a learning rate of 1 is a good choice), but I think it's probably cleaner to keep the usage of the SGD learning_rate parameter consistent across learning rules and to be explicit about the use of a different type of hyperparameter in AdaDelta. Thoughts?

Relevant code here.

madisonmay commented 10 years ago

Would anyone else familiar with pylearn2's AdaDelta implementation care to comment on this one? I'd like to see this issue addressed if at all possible, but perhaps I'm overlooking something obvious.

lamblin commented 10 years ago

The original author is @gdesjardins, maybe he can comment. I don't know much about AdaDelta, I've just had a quick look at the article, and it looks like they directly compare their epsilon parameter to the learning rate of other methods (fig. 1), so it may be reasonable to use learning_rate for that. We should probably add some documentation, though. If we want to eliminate the learning_rate from AdaDelta, I think simply ignoring it would be confusing, and we should instead remove it from SGD, or at least ignore it (and make it an error to have it) if a LearningRule is specified, and have the learning_rate specified in the LearningRule instead, if it makes sense. I don't think it's worth the trouble at that point, but if other learning rules come up, I might change my mind.

madisonmay commented 10 years ago

From the arxiv paper on AdaDelta:

 The benefits of this approach are as follows:
• no manual setting of a learning rate.
• insensitive to hyperparameters.
...

The idea behind AdaDelta is that the "method requires no manual tuning of a learning rate." AdaDelta deals sets the learning rate of each weight individually, whereas pylearn2's learning_rate parameter acts to set a global learning rate. Table 1 in the paper shows that ADAGRAD and other methods are sensitive to hyper parameters, while Table 2 in the paper shows that the value of ADADELTA's epsilon parameter does not have a large effect on model performance.

My concern is not that a global learning_rate is passed in (that should be fine to treat the learning_rate as a float that scales each gradient update, although the AdaDelta learning rule is "insensitive to hyperparameters" and will quickly compensate for the scaled updates by reducing the coefficients on the individual weight learning_rate's), but that it is directly used as the epsilon parameter without mention in the documentation. Whereas for standard SGD or SGD with momentum high learning rates will cause wild oscillations in the validation error over time, a large epsilon for AdaDelta will simply cause the weight updates to approach zero more quickly, so it would be quite easy to become confused when manual modification of the learning_rate parameter does not have the expected effect on a model.

If we want to eliminate the learning_rate from AdaDelta, I think simply ignoring it would be confusing, and we should instead remove it from SGD, or at least ignore it (and make it an error to have it) if a LearningRule is specified, and have the learning_rate specified in the LearningRule instead, if it makes sense.

I would support moving learning_rate to being a parameter of LearningRule, but that would obviously break a lot of existing code so it's perhaps not the best route forward. A simpler approach might be to add in an optional epsilon parameter to AdaDelta that defaults to something reasonable like 1e6, scale each weight's calculated learning rate by the global learning_rate specified by the argument to SGD, and update the docs to be much more explicit about this behavior. I would also be fine with raising an error / warning when a learning_rate is set and the AdaDelta learning rule is applied.

I'd love to hear @gdesjardins's opinion, though. My apologies for the wall of text.

cancan101 commented 9 years ago

I was also confused by the fact that learning rate becomes epsilon.

kudkudak commented 9 years ago

Same here, I spent some time trying to increase learning rate wildly just to observe any effect ;)

lisa-lab / pylearn2

Treatment of Learning Rate in AdaDelta #1028