Open wangkuiyi opened 6 years ago
Not just sigmoid and tanh -- both of them are not sufficiently good.
We should include ReLU its variants, like
A comparison is here https://datascience.stackexchange.com/questions/14349/difference-of-activation-functions-in-neural-networks-in-general
An additional notice is that the maximum slope of sigmoid is 1/4, but that of the tanh is 1, which is four times larger than that of the sigmoid. A larger gradient is preferred primarily because gradients are multiplied along the chain rule.
ReLU need to work with batch norm.
Not just sigmoid and tanh -- both of them are not sufficiently good.
We should include ReLU its variants, like
A comparison is here https://datascience.stackexchange.com/questions/14349/difference-of-activation-functions-in-neural-networks-in-general
An additional notice is that the maximum slope of sigmoid is 1/4, but that of the tanh is 1, which is four times larger than that of the sigmoid. A larger gradient is preferred primarily because gradients are multiplied along the chain rule.