Gradient values and gradient relevance

Hi, I've been trying the code for a while and have a couple of questions about the gradients. To give some context here, I would like to identify which feature maps are the most relevant for a specific class, e.g. not use all the feature maps for visualization, but only the most important.

So the paper says that after doing the back-propagation for a specific class, we average the gradients, and that captures the "importance" of a feature map. I've been exploring the distribution for each layer using alexnet and here I show the distribution of those averaged gradients for a specific layer in AlexNet: So we have a distribution with both positive and negative gradients that are close to zero. As the code is implemented, we use all of those gradients and that results in a visualization that looks like this for the class gondola from imagenet: At first, I thought, ok I won't use all the gradients, I only want the gradients closest to zero and so I set a window to include those. And here comes my first question:

Are the gradients closest to zero (or zero), the ones that would require a smaller update for the kernels during training? meaning that those feature maps are the most relevant and meaningful for the final classification?

After trying that I didn't get an improved visualization, so I kept exploring the gradients. Then I decided to only use the positive or only the negative gradients, here are the results: Only positive gradients: Only negative gradients: It seems for me that the negative gradients are the most meaningful, even more than using all the gradients, and the same happens for other images as well. Here I have my next question:

Why would the negative gradients be more relevant for the visualization?
What's going on with only the positive values and why are those guiding to a completely different visualization than in the other cases?

Thanks in advance for any answer and I'm looking forward for the discussion.

Hello,

Let me answer your questions to the best of my knowledge.

Are the gradients closest to zero (or zero), the ones that would require a smaller update for the kernels during training? meaning that those feature maps are the most relevant and meaningful for the final classification?

I don't know, and I don't think anybody knows(for sure). It hard to make such statement without a formal proof.

Why would the negative gradients be more relevant for the visualization?

It may be like that for one case. You can't make a general statement like "negative gradients are more useful" based on one (or a couple) samples.

What's going on with only the positive values and why are those guiding to a completely different visualization than in the other cases?

Again, I don't have a concrete answer. It is hypothesized that gradients highlight the important features. There are some studies that suggested positive ones are more important than negative ones for the prediction (for example, guided backprop) but there is no clear answer.

The problem with interpretability of neural networks is that there is no mathematical definition of 'interpretability'. You can create your own technique and say "oh look, it highlights the parts I wanted it to highlight" but it is a self fulfilling prophecy because the problem of interpretability is ill-defined.

One way to define it is to use saliency maps as weakly-supervised localization techniques, as used in Grad-CAM paper. It makes sense, but as far as I know, there were a couple of other studies which showed the topic of interpretability and the proposed techniques thus far are unreliable.

If I were you, I would refrain from making statements based on what you see on a couple of images because we (humans) are biased towards what we want to see. Plus, we make the assumption that the model is learning the object of interest that we think is important in the image. It might be that the model is not learning what we think it is learning, but instead bases its prediction on something else (but still getting correct prediction).

Hi Utku,

Thank you for your comments, I really appreciate those. A couple of comments from my side too:

It's true that a specific definition of interpretability of a CNN could not be given by our visual interpretations, however, that's the idea of using methods as Grad-CAM to somehow try to understand that black box.

My idea about the gradients close to zero is as follows: For the weight update in the kernels that produce the feature maps we have this: 0_B_2wAzU14ush5uBO_ We would like the gradients to be zero, so the update to the weight is small indicating that the kernel gives already a good output for the classification. Is this line of thinking valid?

And as you said:

It may be like that for one case. You can't make a general statement like "negative gradients are more useful" based on one (or a couple) samples.

I tried the same comparison for the use case that I'm working on, and it doesn't reflect that the negative gradients highlight the (at least according to human interpretation) important regions of the image. So yes, this could be very case sensitive.

My firs thought was to somehow truncate those gradient values to identify which specific feature maps, among the whole vecotr of feature maps, are the most relevant for the classification. The results for my use case were not as good as expected because of the arisen questions about the values of the gradients.

So here I would like to ask from some input from you, any suggestion on how to identify the most relevant features for the classification? Even if those feature maps don't highlight human-like-to-see regions.

Thanks again!

Hello again,

Sorry for the late reply.

The questions you ask are a major area of research in the field of deep learning. Like I previously said, I am not sure whether your idea would work or not, it is certainly worth a shot, since most of the interpretability stuff that has been tried so far are gimmicky gradient stuff on the first layer except for the CAM based approaches.

If you would ask my opinion on the whole 'interpretability' stuff: I find all of them severly lacking because of the assumption that models 'learn' what we think is relevant in the picture. I believe, more often than not, models do not learn what people think they learn and what they actually 'learn' is hard to quantify.

What I would suggest you is that you keep your novel ideas to yourself and do not reveal them publicly before doing experiments and writing them down, as ideas can easily be poached (which is not unheard of in the field of comp. sci.).

Best of luck!

utkuozbulak / pytorch-cnn-visualizations

Gradient values and gradient relevance #44