Normalization Operation

cyy280113999 commented 2 years ago

Hi, haofanwang, the author of score_cam.

I'm a student learning interpretation of CNN. Something confused me about score_cam. I copy a paragraph as follows in the paper "Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks".

3.2. Normalization on Score Each forward passing in neural network is independent, the score amplitude of each forward propagation is unpredictable and not fixed. The relative output value (post-softmax) after normalization is more reasonable to measure the relevance than absolute output value (pre-softmax). Thus, in Score-CAM, we represent weight as post-softmax value, so that the score can be rescaled into a fixed range. ... Normalization operation equips Score-CAM with good class discrimination ability.

What is exactly the Normalization Operation? After reading, I have two ideas.

Normalization on logits. Vgg16 in pytorch output logits that can include negative elements. The probs(probabilities), is the output of logits after softmax function. The prob of class c is the Normalization. This idea comes from the replacement of score function. Score function without norm output a logit, otherwise prob.
Normalization on scores. Scores(CIC) of every channel are stored in tensor scores. Scores act as weights. As written in Algorithm 1, scores are sent to softmax inplace, to ensure the sum of them equals one.

Is both operation applied? Which of them improved the discrimination power?

BTW, when both were applied , the effect is worse.

haofanwang commented 2 years ago

We apply normalization on logits, and we call the post-softmax results as scores which in range of [0, 1].

Could you clarify the difference between them or post your code?

cyy280113999 commented 2 years ago

Sure.

            b, c, h, w = input.size()
            assert b == 1

            # predication on raw input
            logit = self.model_arch(input.cuda())
            if class_idx is None:
                predicted_class = logit.max(1)[-1]
            else:
                predicted_class = torch.LongTensor([class_idx])

            # origin version
            if not sg:
                net_fun = lambda x: self.model_arch(x)[:, predicted_class]
            # new version , softmax gradient
            else:
                net_fun = lambda x: F.softmax(self.model_arch(x), 1)[:, predicted_class]

            # net_fun(input)
            activations = self.activations
            b, k, u, v = activations.size()

            # score_saliency_map = torch.zeros((1, 1, h, w)).cuda()
            scores = torch.zeros((k,)).cuda()

            # config with your gpu
            parallel_batch = 64
            for i in range(0, k, parallel_batch):
                j_exclude = i + parallel_batch
                # upsampling
                mask = activations[0, i:j_exclude, :, :].reshape(parallel_batch, 1, u, v)
                mask = F.interpolate(mask, size=(h, w), mode='bilinear', align_corners=False)
                # normalize
                mask = binarize(mask)
                # save the score
                scores[i:j_exclude] = net_fun(input * mask).reshape(parallel_batch)
            if norm:
                scores = F.softmax(scores, 0)
            score_saliency_map = (scores.reshape(1, k, 1, 1) * activations).sum(1, keepdim=True)
            score_saliency_map = F.interpolate(score_saliency_map, size=(h, w), mode='bilinear', align_corners=False)
            if relu:
                score_saliency_map = F.relu(score_saliency_map)
                score_saliency_map = binarize(score_saliency_map)
            else:
                score_saliency_map = double_way_normalize(score_saliency_map)

As you can see, 'sg' makes the single real value function of network $\phi(x)$ different. 'norm' decide normalization on scores or not.

Here show examples of class dog in image.

No sg, No norm It means logits as weights. cat_dog_243_282 png_cl282 With norm cat_dog_243_282 png_cl282_norm With sg cat_dog_243_282 png_cl282_sg With sg+norm

haofanwang commented 2 years ago

(1) What is 'sg'? If I understand correctly, it should be identical to without 'sg' + softmax. It's unclear to me why you add such an operation in the procedure.

(2) It makes no sense to apply both 'sg' and 'norm'. The softmax would be conducted twice in such case. Below is an example. This will bring the scores between the different categories closer to the same.

import torch
import torch.nn as nn
input = torch.randn(1,3)
softmax = nn.Softmax(dim=1)

# the assertion will be invalid only if all values are equal in input
assert softmax(input) != softmax(softmax(input))

Please let me know if you have further question, I'm delighted to discuss.

cyy280113999 commented 2 years ago

Yes, it's weird. Softmax works on $\mathbb{R}$. Applied to prob between 0 and 1 makes scores further loss discrimination. I just do it to understand the process in the paper.

Keep this issue for "Normalization". It closed.

amirro1 commented 6 months ago

Hi,

I know this issue was closed, but I'm still a little confused about the two applications of the softmax function. To me, it seems like there are two (separate) places that the softmax could be applied:

We could run a softmax on the model's outputs (I think you guys called 'norm' above)
We could run a softmax on the weights we get for each activation map (I think you guys called this 'sg' above).

My understanding is that these two applications perform different functions. In 1, we are normalizing the models output given a particular perturbed input, and in 2, we are normalizing the weights for each activation map. However, above, you say that it makes no sense to combine sg and norm. Could you elaborate on this a bit further? Are sg and norm somehow actually performing the same function?

cyy280113999 commented 6 months ago

There are two things I need to explain.

The first case will show you that appling either of them works similarly.

Let's assume we have only two masks to run. The model has only two classes in logits, and we focus on class one.

1.1 apply softmax on logits

First, look at the mask one. We feed the masked input into model, and get logits back. like this:

$$[z_1,z_2]$$

We need the score of class one but firstly, we apply softmax on them. Get this:

$$[\frac{e^{z_1}}{e^{z_1},+e^{z_2}}, \frac{e^{z_2}}{e^{z_1},+e^{z_2}}]$$

Then we select the first class of them.

$$\frac{e^{z_1}}{e^{z_1},+e^{z_2}}$$

Samely, we get the score of mask two, like:

$$[z_3,z_4]$$

$$[\frac{e^{z_3}}{e^{z_3},+e^{z_4}}, \frac{e^{z_4}}{e^{z_3},+e^{z_4}}]$$

$$\frac{e^{z_3}}{e^{z_3},+e^{z_4}}$$

We put them together as the scores of the masks, like:

$$[\frac{e^{z_1}}{e^{z_1},+e^{z_2}}, \frac{e^{z_3}}{e^{z_3},+e^{z_4}}]$$

1.2 apply softmax on scores

finaly, we should get this:

$$[z1,z2],[z3,z4]$$

$$[z1,z3]$$

$$[\frac{e^{z_1}}{e^{z_1},+e^{z_3}}, \frac{e^{z_3}}{e^{z_1},+e^{z_3}}]$$

they look very similar...

The first case will show you that appling both of them losses class discrimination.

suppose we have apply softmax and get $[p_1,p_3]$ already.

and then we apply softmax on them again, get $[pp_1,pp_3]$

as you know, the input of softmax ranges from negative infinity to positive infinity

if $p_1\ge p2$, then $pp1\ge pp2$. it keeps the maximum. but

$$|pp_1-pp_2|\le |p_1-p_2|$$ .

the probilities get closer to 0.5, the difference becomes smaller.

amirro1 commented 6 months ago

Thank you so much for your quick response.

So would the following be a correct interpretation of what you said above:

Applying the softmax on the model's output (from a single perturbed image) is similar to (but not exactly the same) as applying the softmax to each activation weight (i.e. the logits for the class of interest across all perturbed inputs). However, it is not appropriate to apply the softmax in both places, because this would "minimize the difference" between distinct values

haofanwang / Score-CAM

Normalization Operation #26