haifengl / smile

Statistical Machine Intelligence & Learning Engine
https://haifengl.github.io
Other
6.03k stars 1.13k forks source link

Gradient backpropagation in ReLU #528

Closed albertmolinermrf closed 4 years ago

albertmolinermrf commented 4 years ago

Expected behaviour

In an MLP, the gradient of the upper layers has to be backpropagated to lower layers for updates to be calculated properly.

Actual behaviour

When using a v2.2.2 rectifier layer, the gradient from the upper layer is always overridden with either 0 or 1. This causes rectifier layers to be unable to minimize the cost function as they should.

I think the problem lies in this line: https://github.com/haifengl/smile/blob/acf15aa1e5366215aa576a47ec4ce0826db8c335/core/src/main/java/smile/base/mlp/ActivationFunction.java#L95 A * is missing before =, so that the gradient from the upper layer is also taken into account.

I can provide examples of the consequences of the wrong behaviour if needed.

haifengl commented 4 years ago

ReLU(x) gradient is 1 if x > 0 or 0 otherwise. Note that y in g(g, y) is the output of weight * input. That is, it is x in ReLU(x)

albertmolinermrf commented 4 years ago

Indeed, the gradient of max(0, x) is either 0 or 1.

However, the gradient of the cost function over a weight in a ReLU is, applying the chain rule, the gradient of the output of the ReLU over the weight (either 0 or 1) multiplied by the gradient of the cost function over the output of the ReLU (the incoming g). Currently, the latter term is discarded.

Consider how the gradient is applied to the particular case of bias, which is simpler to analyze: https://github.com/haifengl/smile/blob/482f63d0c4fe768a3835c54697a9401f990b707e/core/src/main/java/smile/base/mlp/Layer.java#L144 If in a ReLU gradient has only 0 or 1 values, then updateBias is always positive and bias will either remain the same or increase (always by the same amount), but never decrease.

Bias should be able to increase, decrease or remain stable depending on the characteristics of the data and model. The multiplication by the gradient of the cost function allows for this exactly in the required way, by the chain rule.

haifengl commented 4 years ago

Thanks. I fix it:

g[i] *= y[i] > 0 ? 1 : 0;

The MSE of regression test is generally smaller.

haifengl commented 4 years ago

@albertmolinermrf v2.3 is out with the fix.