Closed albertmolinermrf closed 4 years ago
ReLU(x)
gradient is 1 if x > 0 or 0 otherwise. Note that y
in g(g, y)
is the output of weight * input. That is, it is x
in ReLU(x)
Indeed, the gradient of max(0, x)
is either 0 or 1.
However, the gradient of the cost function over a weight in a ReLU is, applying the chain rule, the gradient of the output of the ReLU over the weight (either 0 or 1) multiplied by the gradient of the cost function over the output of the ReLU (the incoming g
). Currently, the latter term is discarded.
Consider how the gradient is applied to the particular case of bias, which is simpler to analyze:
https://github.com/haifengl/smile/blob/482f63d0c4fe768a3835c54697a9401f990b707e/core/src/main/java/smile/base/mlp/Layer.java#L144
If in a ReLU gradient
has only 0 or 1 values, then updateBias
is always positive and bias will either remain the same or increase (always by the same amount), but never decrease.
Bias should be able to increase, decrease or remain stable depending on the characteristics of the data and model. The multiplication by the gradient of the cost function allows for this exactly in the required way, by the chain rule.
Thanks. I fix it:
g[i] *= y[i] > 0 ? 1 : 0;
The MSE of regression test is generally smaller.
@albertmolinermrf v2.3 is out with the fix.
Expected behaviour
In an MLP, the gradient of the upper layers has to be backpropagated to lower layers for updates to be calculated properly.
Actual behaviour
When using a v2.2.2 rectifier layer, the gradient from the upper layer is always overridden with either 0 or 1. This causes rectifier layers to be unable to minimize the cost function as they should.
I think the problem lies in this line: https://github.com/haifengl/smile/blob/acf15aa1e5366215aa576a47ec4ce0826db8c335/core/src/main/java/smile/base/mlp/ActivationFunction.java#L95 A
*
is missing before=
, so that the gradient from the upper layer is also taken into account.I can provide examples of the consequences of the wrong behaviour if needed.