Need a chapter comparing activations

Not just sigmoid and tanh -- both of them are not sufficiently good.

We should include ReLU its variants, like

leaky ReLU: add a constant.
PReLU: add a scalar parameter for each ReLU neuron.
maxout: double the number of parameters for each ReLU neuron.

A comparison is here https://datascience.stackexchange.com/questions/14349/difference-of-activation-functions-in-neural-networks-in-general

An additional notice is that the maximum slope of sigmoid is 1/4, but that of the tanh is 1, which is four times larger than that of the sigmoid. A larger gradient is preferred primarily because gradients are multiplied along the chain rule.

wangkuiyi / deeplearning

Need a chapter comparing activations #14