softmax derivative in nnbp.m

rasmusbergpalm / DeepLearnToolbox

Matlab/Octave toolbox for deep learning. Includes Deep Belief Nets, Stacked Autoencoders, Convolutional Neural Nets, Convolutional Autoencoders and vanilla Neural Nets. Each method has examples to get you started.

BSD 2-Clause "Simplified" License

3.8k stars 2.28k forks source link

softmax derivative in nnbp.m #113

Closed Moadab-AI closed 10 years ago

Moadab-AI commented 10 years ago

There seems to be a bug in the the backpropagation algorithm in the NN folder. the bug is in calculating the deltas, and the derivative of the activation function in the output layer : (lines 7-12)

switch nn.output
    case 'sigm'
        d{n} = - nn.e .* (nn.a{n} .* (1 - nn.a{n}));
    case {'softmax','linear'}
        d{n} = - nn.e;
end

shouldn't the softmax be bundled with 'sigm' rather than 'linear', since the derivative of softmax is identical to 'sigm' ? (an (1 - an)) ? or am i missing something ?

LNK123765 commented 10 years ago

I don't believe the derivative of the sigmoid is equivalent to the derivative of the softmax. Have you performed a gradient check using softmax output?

Moadab-AI commented 10 years ago

well I did do the sort of trivial hand calculation and ended up with the exact result as sigmoid's of "an (1 - an)" :

img_20140725_113443770

also when i tried to look it up for confirmation i got the same answer like in e.g. http://www.cs.bham.ac.uk/~jxb/INC/l7.pdf page 13

but above all even if all this is wrong, by no mean I can see the derivative to be "1" ...as it is in the code ..can it ?

and no i didn't do the gradient check as the math of it didnt sound right in first place.

pakozm commented 10 years ago

I'm not a user of DeepLearnToolbox, but I'm interested ;-) The derivation of cross-entropy loss and softmax activation is coupled in the way that you can achieve better numerical stability if you compute the delta at output layer taking into account both functions. In this way, several NN tools implement the cross-entropy loss derivative to be the combination of cross-entropy+softmax derivative, and therefore the delta computation at softmax layer is reduced to a linear transformation. You can see more in the following paper:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.49.6403

LNK123765 commented 10 years ago

Thank you for the derivation, mabdollahi. I see now that is quite easy.

Thank you for the link pakozm as this has been confusing me.

Moadab-AI commented 10 years ago

Oh thank you very much Pakozm ! you're a life saver. The article you referred to answered precisely my question and based on that seems like the code is just fine if you pair the CE cost function with softmax activations.

pakozm commented 10 years ago

Your welcome :-) I was also a bit confused about this a time ago ;-)

rasmusbergpalm commented 10 years ago

Thanks a lot @pakozm :)