keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.91k stars 19.45k forks source link

The gradients of the output of a softmax layer w.r.t. a certain former layer are all zeros. #5881

Closed zehzhang closed 7 years ago

zehzhang commented 7 years ago

I am trying to implement Grad-CAM and need to compute the gradients of the output of the last softmax layer w.r.t. a certain former layer.

This is my model:

base_model = VGG16(weights='imagenet', include_top=False, input_shape=(img_width, img_height, 3)) model = Sequential() model.add(base_model) model.add(Flatten()) model.add(Dense(4096, activation='relu')) model.add(Dense(4096, activation='relu')) model.add(Dense(24, activation='softmax'))

If I compute the gradients like this: grads = K.gradients(model.layers[-1].output[0, 0], model.layers[-5].layers[-2].output)[0]

I will get an array all of zeros.

However, if I compute the gradients of the output of the last but two layer w.r.t. the same former layer: grads = K.gradients(model.layers[-2].output[0, 0], model.layers[-5].layers[-2].output)[0]

I will get reasonable gradients.

So how can I solve this?

AvantiShri commented 7 years ago

Have you considered that this is simply happening because the predictions are confident and thus the gradient of the softmax has saturated to be near zero?

zehzhang commented 7 years ago

@AvantiShri Yes you are correct! I think it is because the network is pretty sure and thus gradients are zeros. I am now trying to apply another activation in the last layer so that I can get non-zero gradients. Thanks!

AvantiShri commented 7 years ago

Another thing you can do is just separate out the dense layer into model.add(Dense(24)) and model.add(Activation("softmax")) and then just take the gradients w.r.t. the input to the softmax (layers[-2]). If you want, you can also normalize the gradient for each class in layers[-2] by subtracting the mean gradient across all classes.

zehzhang commented 7 years ago

@AvantiShri Yes I once think of your way. Though now I am using relu in the last layer and it works well, I think your way is more elegant because I can correctly compute the gradients as well as get the label prediction. I am not very clear about the normalization part. Could you please talk more about the intent of normalizing in your way?

AvantiShri commented 7 years ago

Imagine your dense layer has two classes, c1 and c2, and the output is softmax(c1, c2). Now consider some neuron x, which has a gradient of 10 w.r.t c1 and 10 w.r.t c2, and some other neuron y which has a gradient of 10 w.r.t c1 and 0 w.r.t. c2. Because the softmax operation normalizes by all classes in the denominator, the net effect of y on the softmax output for c1 is more than the net effect of x on the softmax output for c1 (because x also affects c2 as much as it effects c1, thus the net effect of x on c1 cancels out). One way to make sure you give more importance to y than to x in this situation is to normalize by the mean gradient w.r.t all classes - so for y, the mean gradient is 5 over all classes and thus after normalization you get y has an effect of 5 on c1 and -5 on c2. For x, the mean gradient is 10 over all classes and thus after normalization you get x has an effect of 0 on c1 and 0 on c2.

AvantiShri commented 7 years ago

A similar idea to the one described above is mentioned in the deeplift preprint - see seciont 2.5 titled "a note on Softmax activations" (there, it is the softmax weights that are mean normalized; it is a less thorough solution than the one I described in my previous comment, but in the same spirit: https://arxiv.org/pdf/1605.01713.pdf). For your information, an updated version of deeplift will be released soon which uses the normalization approach described in the previous comment (and which will also be easier to follow; the current preprint is pretty hard to understand).

zehzhang commented 7 years ago

@AvantiShri I got you! Many thanks for the detailed explanation. Just one more question, do you know whether there is any convenient way to compute the mean gradient? As far as I can tell, I need to build as many gradient functions as the number of the classes I want to classify and then compute the mean over all the gradients. It is not difficult, but really cumbersome.

AvantiShri commented 7 years ago

How about: K.gradients(K.mean(model.layers[-2].output[0, :], axis=-1), model.layers[-5].layers[-2].output)[0] This works because gradients are linear, i.e. the gradient of the sum of two variables is the sum of the gradients of each.

(Also, I notice you compute the gradients w.r.t. model.layers[-2].output[0,0] - I figure you are doing this because the first argument to K.gradients needs to be a scalar, but consider computing the gradients w.r.t. K.sum(model.layers[-2].output[:,0], axis=0) - that way, you will get the gradients for everything in the batch, not just the first sample. Similarly, the mean gradient computation would be w.r.t. K.sum(K.mean(model.layers[-2].output[:, :], axis=-1),axis=0). The reason you can sum across the batch axis is that each sample in the batch is totally independent).

zehzhang commented 7 years ago

@AvantiShri Thanks for all the help! In fact I am also doing some work on visualizing the network recently. I will look forward to see the new version of your paper posted above and hope I can get some inspiration from it! 😃

vinayakumarr commented 7 years ago

my network is given below

1. define the network

model = Sequential() model.add(LSTM(4,input_dim=42)) # try using a GRU instead, for fun model.add(Dropout(0.1)) model.add(Dense(5)) model.add(Activation('softmax'))

Taylor expansion and compute first partial derivative of the classification results

from keras import backend as K import theano def compile_saliency_function(model): """ Compiles a function to compute the saliency maps and predicted classes for a given minibatch of input images. """ inp = model.layers[0].input print("-----------------------input-----------------------------") print(inp) outp = model.layers[-1].output print("-----------------------output----------------------------") print(outp) max_outp = K.T.max(outp, axis=1) print(max_outp) saliency = K.gradients(K.sum(max_outp), inp) print(saliency) max_class = K.T.argmax(outp, axis=1) print(max_class) v1 = K.function([inp, K.learning_phase()], [saliency, max_class]) return v1

print(([X_train[:20], 0])[0]) v = compile_saliency_function(model)([X_train[:5], 0])[0] print(v)

Sailnce map gives [[[ -8.015e-29 6.428e-28 -1.365e-29 3.198e-30 -1.229e-29 -2.009e-30 -1.273e-30 -2.512e-29 -6.153e-28 3.688e-29 2.205e-28 4.094e-28 5.667e-28 -1.401e-27 1.892e-28 5.810e-30 1.736e-29 7.379e-29 -2.122e-28 8.063e-29 3.660e-31 1.458e-31 9.196e-28 -1.335e-30 1.608e-30 -8.264e-29 -1.094e-28 1.099e-29 6.411e-30 1.134e-28 3.281e-29 3.869e-29 -1.276e-30 -9.509e-31 8.313e-29 4.073e-29 -8.345e-30 3.455e-29 -1.100e-28 -8.354e-29 -5.175e-29 1.583e-29]]

[[ -8.015e-29 6.428e-28 -1.365e-29 3.198e-30 -1.229e-29 -2.009e-30 -1.273e-30 -2.512e-29 -6.153e-28 3.688e-29 2.205e-28 4.094e-28 5.667e-28 -1.401e-27 1.892e-28 5.810e-30 1.736e-29 7.379e-29 -2.122e-28 8.063e-29 3.660e-31 1.458e-31 9.196e-28 -1.335e-30 1.608e-30 -8.264e-29 -1.094e-28 1.099e-29 6.411e-30 1.134e-28 3.281e-29 3.869e-29 -1.276e-30 -9.509e-31 8.313e-29 4.073e-29 -8.345e-30 3.455e-29 -1.100e-28 -8.354e-29 -5.175e-29 1.583e-29]]

[[ -8.015e-29 6.428e-28 -1.365e-29 3.198e-30 -1.229e-29 -2.009e-30 -1.273e-30 -2.512e-29 -6.153e-28 3.688e-29 2.205e-28 4.094e-28 5.667e-28 -1.401e-27 1.892e-28 5.810e-30 1.736e-29 7.379e-29 -2.122e-28 8.063e-29 3.660e-31 1.458e-31 9.196e-28 -1.335e-30 1.608e-30 -8.264e-29 -1.094e-28 1.099e-29 6.411e-30 1.134e-28 3.281e-29 3.869e-29 -1.276e-30 -9.509e-31 8.313e-29 4.073e-29 -8.345e-30 3.455e-29 -1.100e-28 -8.354e-29 -5.175e-29 1.583e-29]]

[[ -1.190e-28 7.039e-28 -3.475e-30 3.815e-30 -1.410e-29 -2.210e-30 -1.390e-30 -3.224e-29 -7.007e-28 3.122e-29 2.429e-28 4.564e-28 6.228e-28 -1.537e-27 1.703e-28 1.865e-29 1.983e-29 7.343e-29 -2.412e-28 7.795e-29 3.947e-31 1.972e-31 9.979e-28 -1.683e-30 1.429e-30 -1.043e-28 -1.336e-28 1.509e-29 6.298e-30 1.371e-28 4.295e-29 7.593e-29 -1.360e-30 -5.704e-31 9.312e-29 8.257e-29 -3.896e-30 4.088e-29 -1.375e-28 -1.039e-28 -6.191e-29 1.965e-29]]

[[ -1.188e-28 7.032e-28 -3.503e-30 3.810e-30 -1.408e-29 -2.208e-30 -1.388e-30 -3.220e-29 -6.999e-28 3.121e-29 2.427e-28 4.559e-28 6.221e-28 -1.535e-27 1.702e-28 1.860e-29 1.980e-29 7.337e-29 -2.409e-28 7.790e-29 3.943e-31 1.969e-31 9.969e-28 -1.681e-30 1.429e-30 -1.042e-28 -1.334e-28 1.507e-29 6.294e-30 1.369e-28 4.289e-29 7.576e-29 -1.359e-30 -5.711e-31 9.302e-29 8.237e-29 -3.906e-30 4.083e-29 -1.373e-28 -1.038e-28 -6.183e-29 1.962e-29]]]

Visualizing the average activation values of the input features for 5 samples of class 0

def get_activations(model, layer, X_batch): get_activations = K.function([model.layers[0].input, K.learning_phase()], model.layers[layer].output) activations = get_activations([X_batch,0]) return activations my_featuremaps = get_activations(model, 0, ([X_train[:5], 0])[0]) print(my_featuremaps) np.savetxt('featuremap', my_featuremaps)

activation values [[ 0.762 -0.122 0.044 -0.758] [ 0.762 -0.122 0.044 -0.758] [ 0.762 -0.122 0.044 -0.758] [ 0.762 -0.038 0.043 -0.752] [ 0.762 -0.038 0.043 -0.752] [ 0.762 -0.038 0.043 -0.752] [ 0.125 -0. 0.038 -0.083] [ 0.762 -0.121 0.044 -0.758] [ 0.762 -0.038 0.043 -0.752] [ 0.762 -0.433 0.091 -0.755]]

Could you please tell that the followed methid is correct if So could you please tell how to generate the plots fig 3 fig 6 of the paper titled "Empowering Convolutional Networks for Malware Classification and Analysis"

stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.